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Preface 


Over  the  past  decades  great  importance  has  been  placed  on  stochastic  calculus 
and  processes  in  mathematics,  finance,  and  econometrics.  This  book  addresses 
particularly  readers  from  these  fields,  although  students  of  other  subjects  as  biology, 
engineering,  or  physics  may  find  it  useful,  too. 


Scope  of  the  Book 

By  now  there  exist  a  number  of  books  describing  stochastic  integrals  and  stochastic 
calculus  in  an  accessible  manner.  Such  introductory  books,  however,  typically 
address  an  audience  having  previous  knowledge  about  and  interest  in  one  of  the 
following  three  fields  exclusively:  finance,  econometrics,  or  mathematics.  The 
textbook  at  hand  attempts  to  provide  an  introduction  into  stochastic  calculus  and 
processes  for  students  from  each  of  these  fields.  Obviously,  this  can  on  no  account 
be  an  exhaustive  treatment.  In  the  next  chapter  a  survey  of  the  topics  covered 
is  given.  In  particular,  the  book  does  neither  deal  with  finance  theory  nor  with 
statistical  methods  from  the  time  series  econometrician’s  toolkit;  it  rather  provides 
a  mathematical  background  for  those  readers  interested  in  these  fields. 

The  first  part  of  this  book  is  dedicated  to  discrete-time  processes  for  modeling 
temporal  dependence  in  time  series.  We  begin  with  some  basic  principles  of 
stochastics  enabling  us  to  define  stochastic  processes  as  families  of  random  variables 
in  general.  We  discuss  models  for  short  memory  (so-called  ARMA  models),  for 
long  memory  (fractional  integration),  and  for  conditional  heteroscedasticity  (so- 
called  ARCH  models)  in  respective  chapters.  One  further  chapter  is  concerned 
with  the  so-called  frequency  domain  or  spectral  analysis  that  is  often  neglected  in 
introductory  books.  Here,  however,  we  propose  an  approach  that  is  not  technically 
too  demanding.  Throughout,  we  restrict  ourselves  to  the  consideration  of  stochastic 
properties  and  interpretation.  The  statistical  issues  of  parameter  estimation,  testing, 
and  model  specification  are  not  addressed  due  to  space  limitations;  instead,  we  refer 
to,  e.g.,  Mills  and  Markellos  (2008),  Kirchgassner,  Wolters,  and  Hassler  (2013),  or 
Tsay  (2005). 

The  second  part  contains  an  introduction  to  stochastic  integration.  We  start  with 
elaborations  on  the  Wiener  process  W(t)  as  we  will  define  (almost)  all  integrals  in 
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terms  of  Wiener  processes.  In  one  chapter  we  consider  Riemann  integrals  of  the 
form  f  f(t)W(t)dt ,  where  /  is  a  deterministic  function.  In  another  chapter  Stieltjes 
integrals  are  constructed  as  f  f(t)dW{t).  More  specifically,  stochastic  integrals  as 
such  result  when  a  stochastic  process  is  integrated  with  respect  to  the  Wiener 
process,  e.g.,  the  Ito  integral  f  W(t)dW(t).  Solving  stochastic  differential  equations 
is  one  task  of  stochastic  integration  for  which  we  will  need  to  use  Ito’s  lemma.  Our 
description  aims  at  a  similar  compromise  between  concreteness  and  mathematical 
rigor  as,  e.g.,  Mikosch  (1998).  If  the  reader  wants  to  address  this  matter  more 
rigorously,  we  recommend  Klebaner  (2005)  or  0ksendal  (2003). 

The  third  part  of  the  book  applies  previous  results.  The  chapter  on  stochastic 
differential  equations  consists  basically  of  applications  of  Ito’s  lemma.  Concrete 
differential  equations,  as  they  are  used,  e.g.,  when  modeling  interest  rate  dynamics, 
will  be  covered  in  a  separate  chapter.  The  second  area  of  application  concerns 
certain  limiting  distributions  of  time  series  econometrics.  A  separate  chapter  on  the 
asymptotics  of  integrated  processes  covers  weak  convergence  to  Wiener  processes. 
The  final  two  chapters  contain  applications  for  nonstationary  processes  without 
cointegration  on  the  one  hand  and  for  the  analysis  of  cointegrated  processes  on  the 
other.  Further  details  regarding  econometric  application  can  be  found  in  the  books 
by  Banerjee,  Dolado,  Galbraith  and  Hendry  (1993),  Hamilton  (1994),  or  Tanaka 
(1996). 

The  exposition  in  this  book  is  elementary  in  the  sense  that  knowledge  of  measure 
theory  is  neither  assumed  nor  used.  Consequently,  mathematical  foundations  cannot 
be  treated  rigorously  which  is  why,  e.g.,  proofs  of  existence  are  omitted.  Rather  I 
had  two  goals  in  mind  when  writing  this  book.  On  the  one  hand,  I  wanted  to  give  a 
basic  and  illustrative  presentation  of  the  relevant  topics  without  many  “troublesome” 
derivations.  On  the  other  hand,  in  many  parts  a  technically  advanced  level  has 
been  aimed  at:  procedures  are  not  only  presented  in  form  of  recipes  but  are  to 
be  understood  as  far  as  possible  which  means  they  are  to  be  proven.  In  order  to 
meet  both  requirements  jointly,  this  book  is  equipped  with  a  lot  of  challenging 
problems  at  the  end  of  each  chapter  as  well  as  with  the  corresponding  detailed 
solutions.  Thus  the  virtual  text  -  augmented  with  more  than  60  basic  examples  and 
45  illustrative  figures  -  is  rather  easy  to  read  while  a  part  of  the  technical  arguments 
is  transferred  to  the  exercise  problems  and  their  solutions.  This  is  why  there  are  at 
least  two  possible  ways  to  work  with  the  book.  For  those  who  are  merely  interested 
in  applying  the  methods  introduced,  the  reading  of  the  text  is  sufficient.  However, 
for  an  in-depth  knowledge  of  the  theory  and  its  application,  the  reader  necessarily 
needs  to  study  the  problems  and  their  solution  extensively. 


Note  to  Students  and  Instructors 

I  have  taught  the  material  collected  here  to  master  students  (and  diploma  students 
in  the  old  days)  of  economics  and  finance  or  students  of  mathematics  with  a  minor 
in  those  fields.  From  my  personal  experience  I  may  say  that  the  material  presented 
here  is  too  vast  to  be  treated  in  a  course  comprising  45  contact  hours.  I  used  the 
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textbook  at  hand  for  four  slightly  differing  courses  corresponding  to  four  slightly 
differing  routes  through  the  parts  of  the  book.  Each  of  these  routes  consists  of  three 
stages:  time  series  models,  stochastic  integration,  and  applications.  After  Part  I  on 
time  series  modeling,  the  different  routes  separate. 

The  finance  route:  When  teaching  an  audience  with  an  exclusive  interest  in 
finance,  one  may  simply  drop  the  final  three  chapters.  The  second  stage  of  the  course 
then  consists  of  Chaps.  7,  8,  9,  10,  and  11.  This  Part  II  on  stochastic  integration  is 
finally  applied  to  the  solution  of  stochastic  differential  equations  and  interest  rate 
modeling  in  Chaps.  12  and  13,  respectively. 

The  mathematics  route:  There  is  a  slight  variant  of  the  finance  route  for  the 
mathematically  inclined  audience  with  an  equal  interest  in  finance  or  econometrics. 
One  simply  replaces  Chap.  13  on  interest  rate  modeling  by  Chap.  14  on  weak  con¬ 
vergence  on  function  spaces,  which  is  relevant  for  modem  time  series  asymptotics. 

The  econometrics  route:  After  Part  I  on  time  series  modeling,  the  students  from 
a  class  on  time  series  econometrics  should  be  exposed  to  Chaps.  7,  8,  9,  and  10  on 
Wiener  processes  and  stochastic  integrals.  The  three  chapters  (Chaps.  11,  12,  and 
13)  on  Ito’s  lemma  and  its  applications  may  be  skipped  to  conclude  the  course 
with  the  last  three  chapters  (Chaps.  14,  15,  and  16)  culminating  in  the  topic  of 
“cointegration.” 

The  nontechnical  route:  Finally,  the  entire  content  of  the  textbook  at  hand  can 
still  be  covered  in  one  single  semester;  however,  this  comes  with  the  cost  of  omitting 
technical  aspects  for  the  most  part.  Each  chapter  contains  a  rather  technical  section 
which  in  principle  can  be  skipped  without  leading  to  a  loss  in  understanding.  When 
omitting  these  potentially  difficult  sections,  it  is  possible  to  go  through  all  the 
chapters  in  a  single  course.  The  following  sections  should  be  skipped  for  a  less 
technical  route: 


3.3 

& 

4.3 

& 

5.4 

& 

6.4 

& 

7.3 

& 

8.4 

& 

9.4 

&  10.4 

& 

11.4 

&  12.2 

& 

13.4 

& 

14.3  & 

15.4 

& 

16.4 

It  has  been  mentioned  that  each  chapter  concludes  with  problems  and  solutions. 
Some  of  them  are  clearly  too  hard  or  lengthy  to  be  dealt  with  in  exams,  while  others 
are  questions  from  former  exams  of  my  own  or  are  representative  of  problems  to  be 
solved  in  my  exams. 

Frankfurt,  Germany  Uwe  Hassler 

July  2015 
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Introduction 
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1.1  Summary 


Stochastic  calculus  is  used  in  finance  and  econom(etr)ics  for  instance  for  solving 
stochastic  differential  equations  and  handling  stochastic  integrals.  This  requires 
stochastic  processes.  Although  stemming  from  a  rather  recent  area  of  mathematics, 
the  methods  of  stochastic  calculus  have  shortly  come  to  be  widely  spread  not  only 
in  finance  and  economics.  Moreover,  these  techniques  -  along  with  methods  of  time 
series  modeling  -  are  central  in  the  contemporary  econometric  tool  box.  In  this 
introductory  chapter  some  motivating  questions  are  brought  up  being  answered  in 
the  course  of  the  book,  thus  providing  a  brief  survey  of  the  topics  treated. 


1.2  Finance 

The  names  of  two  Nobel  prize  winners  dealing  with  finance  are  closely  connected 
to  one  field  of  applications  treated  in  the  textbook  at  hand.  The  analysis  and  the 
modeling  of  stock  prices  and  returns  is  central  to  this  work. 

Stock  Prices 

Let  S(t )  ,  t  >  0,  be  the  continuous  stock  price  of  a  stock  with  return  R(t )  =  S' (t) / S(t) 
expressed  as  growth  rate.  We  assume  constant  returns, 


dt 


Tn  1997,  R.C.  Merton  and  M.S.  Scholes  were  awarded  the  Nobel  prize  jointly,  “for  a  new  method 
to  determine  the  value  of  derivatives”  (according  to  the  official  statement  of  the  Nobel  Committee). 
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This  differential  equation  for  the  stock  price  is  usually  also  written  as  follows: 

dS(t)  =  cS(t)  dt .  (1.1) 

The  corresponding  solution  is  (see  Problem  1.1) 

S(t)  =  S(0)  ect ,  (1.2) 

i.e.  if  c  >  0  the  exponential  process  is  explosive.  The  assumption  of  a  deterministic 
stock  price  movement  is  of  course  unrealistic  which  is  why  a  stochastic  differential 
equation  consistent  with  (1.1)  is  often  assumed  since  Black  and  Scholes  (1973)  and 
Merton  (1973), 


dS(t)  =  c  S(t )  dt  +  o  S(t)  dW(t) ,  (1.3) 

where  dW(t)  are  the  increments  of  a  so-called  Wiener  process  W(t)  (also  referred  to 
as  Brownian  motion,  cf.  Chap.  7).  This  is  a  stochastic  process,  i.e.  a  random  process. 
Thus,  for  a  fixed  point  in  time  t ,  S(t )  is  a  random  variable.  How  does  this  random 
variable  behave  on  average?  How  do  the  parameters  c  and  a  affect  the  expected 
value  and  the  variance  as  time  passes  by?  We  will  find  answers  to  these  questions  in 
Chap.  12  on  stochastic  differential  equations. 


Interest  Rates 

Next,  r(t)  denotes  an  interest  rate  for  t  >  0.  Assume  it  is  given  by  the  differential 
equation 

dr{t)  —  c  ( r{t )  —  fi)  dt  (1-4) 

with  c  e  R  or  equivalently  by 

r(t)  =  =  c  (r(f)  -  n). 

Expression  (1.4)  can  alternatively  be  written  as  the  following  integral  equation: 

r(t)  =  r( 0)  +  c  [  ( r(s )  —  fi)ds.  (1.5) 

Jo 

The  solution  to  this  reads  (see  Problem  1.2) 


r(t)  —  ii  +  ect  (r(0)  —  fi)  . 


(1.6) 
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For  c  <  0  therefore  it  holds  that  the  interest  rate  converges  to  fi  as  time  goes 
by.  Again,  a  deterministic  movement  is  not  realistic.  This  is  why  Vasicek  (1977) 
specified  a  stochastic  differential  equation  consistent  with  (1.4): 

dr(t )  =  c  (r(t)  —  /i)dt  +  g  dW(t )  .  (1.7) 

As  aforementioned,  dW(t )  denotes  the  increments  of  a  Wiener  process.  How  is  the 
interest  rate  movement  (on  average)  affected  by  the  parameter  cl  Which  kind  of 
stochastic  process  is  described  by  (1.7)7  The  answers  to  these  and  similar  questions 
will  be  obtained  in  Chap.  13  on  interest  rate  models. 


Empirical  Returns 

Looking  at  return  time  series  one  can  observe  that  the  variance  (or  volatility) 
fluctuates  a  lot  as  time  passes  by.  Long  quiet  market  phases  characterized  by  only 
mild  variation  are  followed  by  short  periods  characterized  by  extreme  observations 
where  extreme  amplitudes  again  tend  to  entail  extreme  observations.  Such  a 
behavior  is  in  conflict  with  the  assumption  of  normally  distributed  data.  It  is  an 
empirically  well  confirmed  law  (“stylized  fact”)  that  financial  market  data  in  general 
and  returns  in  particular  produce  “outliers”  with  larger  probability  than  it  would  be 
expected  under  normality. 

It  is  crucial,  however,  that  extreme  observations  occur  in  clusters  (volatility 
clusters).  Even  though  returns  are  not  correlated  over  time  in  efficient  markets,  they 
are  not  independent  as  there  exists  a  systematic  time  dependence  of  volatility.  Engle 
(1982)  suggested  the  so-called  ARCH  model  (see  Chap.  6)  in  order  to  capture  the 
outlined  effects.  His  work  constituted  an  entire  field  of  research  known  nowadays 
under  the  keyword  “financial  econometrics”,  and  consequently  he  was  awarded  the 
Nobel  prize  in  2003. 2 


1 .3  Econometrics 

Clive  Granger  (1934-2009)  was  a  British  econometrician  who  created  the  concept 
of  cointegration  (Granger,  1981).  He  shared  the  Nobel  prize  “for  methods  of 
analyzing  economic  time  series  with  common  trends  (cointegration)”  (official 
statement  of  the  Nobel  Committee)  with  R.F.  Engle.  The  leading  example  of 
trending  time  series  he  considered  is  the  random  walk. 


2 R.F.  Engle  shared  the  Nobel  prize  “for  methods  of  analyzing  economic  time  series  with  time- 
varying  volatility  (ARCH)”  (official  statement  of  the  Nobel  Committee)  with  C.W.J.  Granger. 
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Random  Walks 

In  econometrics,  we  are  often  concerned  with  time  series  not  fluctuating  with 
constant  variance  around  a  fixed  level.  A  widely-used  model  for  accounting  for 
this  nonstationarity  are  so-called  integrated  processes.  They  form  the  basis  for  the 
cointegration  approach  that  has  become  an  integral  part  of  common  econometric 
methodology  since  Engle  and  Granger  (1987).  Let’s  consider  a  special  case  -  the 
random  walk  -  as  a  preliminary  model, 

t 

xt  =  Ej  ,  t=l,...,n,  (1.8) 

7=1 

where  {e*}  is  a  random  process,  i.e.  st  and  ss,  t  ^  s,  are  uncorrelated  or  even 
independent  with  zero  expected  value  and  constant  variance  a2.  For  a  random  walk 
with  zero  starting  value  Vo  =  0  it  holds  by  definition  that: 

r\ 

xt  —  xt-\  +  st ,  t  =  1 , . . . ,  n  ,  with  Var(;q)  =  a  t .  (1.9) 

The  increments  can  also  be  written  using  the  difference  operator  A, 

Axt  —  Xt—  Xt- 1  =  £f  . 

Regressing  two  stochastically  independent  random  walks  on  each  other,  a  statisti¬ 
cally  significant  relationship  is  identified  which  is  a  statistical  artefact  and  therefore 
nonsense  (see  Chap.  15).  Two  random  walks  following  a  common  trend,  however, 
are  called  cointegrated.  In  this  case  the  regression  on  each  other  does  not  only 
give  the  consistent  estimation  of  the  true  relationship  but  the  estimator  is  even 
“superconsistent”  (cf.  Chap.  16). 


Dickey-Fuller  Distribution 

If  one  wants  to  test  whether  a  given  time  series  indeed  follows  a  random  walk,  then 
equation  (1.9)  suggests  to  estimate  the  regression 

Xf  —  Cl  Xf—l  £f  ,  t  —  1 ,  .  .  .  ,  Tl  . 

From  this,  the  (ordinary)  least  squares  (LS)  estimator  under  the  null  hypothe¬ 
sis  (1.9),  i.e.  under  a  —  1,  is  obtained  as 

En 

_  t= i  xtxt- 1  _  2^=i  xt-  i  £t 

a  ~  V'1  r2  _  V"  r2 

z_^t=  1  Xt- 1  L^t=  1  Xt- 1 
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This  constitutes  the  basic  ingredient  for  the  test  by  Dickey  and  Fuller  (1979).  Under 
the  null  hypothesis  of  a  random  walk  (a  =  1)  it  holds  asymptotically  ( n  ->  oo) 


n  (a  —  1) 


d 


VT. 


a  > 


(1.10) 


where  stands  for  convergence  in  distribution  and  VTa  denotes  the  so-called 
Dickey-Fuller  distribution.  Corresponding  modes  of  convergence  will  be  explained 
in  Chap.  14.  Since  Phillips  (1987)  an  elegant  way  for  expressing  the  Dickey-Fuller 
distribution  by  stochastic  integrals  is  known  (again,  W(t)  denotes  a  Wiener  process): 


VTa 


fo  W(Q  dWU) 
W2(t)dt 


(1.11) 


Note  (and  enjoy!)  the  formal  correspondence  of  the  sum  of  squares  YTt=\  x2_x  in 
the  denominator  of  a  —  1  and  the  integral  over  the  squared  Wiener  process  in  the 
denominator  of  (1.11),  W2(t)dt  (this  is  a  Riemann  integral,  cf.  Chap.  8).  Just 
as  well  the  sum  ”=1  xt- 1  £t  —  i  xt- 1  Axt  resembles  the  so-called  Ito  integral 

f0l  W(t )  dW(t).  But  how  are  these  integrals  defined,  what  are  they  about?  How  is  this 
distribution  (and  similar  ones)  attained?  And  why  does  there  exist  another  equivalent 
representation, 


VTa 


W2(  1)  -  1 
2  /„'  W2(t)dt  ’ 


(1.12) 


of  the  Dickey-Fuller  distribution?  We  concern  ourselves  with  these  questions  in 
connection  with  Ito’s  lemma  in  Chap.  11. 


Autocorrelation 

The  assumption  of  the  increments  Axt  —  xt  —  xt-\  of  economic  times  series  being 
free  from  serial  (temporal)  correlation  -  as  it  is  true  for  the  random  walk  -  is  too 
restrictive  in  practice.  Thus,  we  have  to  learn  how  the  Dickey-Fuller  distribution  is 
generalized  with  autocorrelated  (i.e.  serially  correlated)  increments.  In  practise,  so- 
called  ARMA  models  are  used  most  frequently  in  order  to  model  autocorrelation. 
This  class  of  models  will  be  discussed  intuitively  as  well  as  rigorously  in  Chap.  3. 
The  so-called  spectral  analysis  translates  autocorrelation  patterns  in  oscillation 
patterns.  In  Chap.  4  we  learn  to  determine  which  frequency’s  or  period’s  oscillations 
add  particularly  intensely  to  a  time  series’  variation.  Often  economists  are  refused 
access  to  spectral  analysis  because  of  the  extensive  use  of  complex  numbers. 
Therefore,  we  suggest  an  approach  that  avoids  complex  numbers.  Finally,  Chap.  5 
introduces  a  model  where  the  temporal  dependence  is  particularly  persistent  such 
that  the  autocorrelations  die  out  more  slowly  than  in  the  ARMA  case.  Such  a  feature 
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has  been  called  “long  memory”  and  is  observed  with  many  economic  and  financial 
series. 


1.4  Mathematics 

Stochastic  calculus,  which  will  be  applied  here,  is  a  rather  recent  area  in  mathemat¬ 
ics.  It  was  pioneered  by  Kiyoshi  Ito  in  a  sequence  of  pathbreaking  papers  published 
in  Japanese  starting  from  the  forties  of  the  last  century.3 4  The  Ito  integral  as  a  special 
case  of  stochastic  integration  is  introduced  in  Chap.  10. 


Ito  Integrals 

The  aforementioned  interest  rate  model  by  Vasicek  (1977)  leads  to  a  stochastic 
process  given  by  an  integral  constructed  as  J^fis)  dW(s)  where/  is  a  deterministic 
function  and  again  dW  denotes  the  increments  of  a  Wiener  process.  Such  integrals  - 
being  in  a  sense  classical  integrals  -  will  be  defined  as  Stieltjes  integrals  in  Chap.  9. 
Ito  integrals  are  a  generalization  of  these.  At  first  glance,  the  deterministic  function 
/  is  replaced  by  a  stochastic  process  X ,  JqX(s)  dW(s).  Mathematically,  this  results 
in  a  considerably  more  complicated  object,  the  definition  thereof  being  a  problem 
on  its  own,  cf.  Chap.  10. 


Ito's  Lemma 


At  this  point,  the  idea  of  Ito’s  lemma  is  briefly  conveyed.  For  the  moment,  assume 
a  deterministic  (differentiable)  function  f(t).  Using  the  chain  rule  it  holds  for  the 
derivative  of  the  square/2: 


df2(t) 


dt 


=  2  f(t)  f(t) 


or  rather 


=  fit)  fit)  dt  =  f{t )  df(t) . 


(1.13) 


3  Alternative  transcriptions  of  his  name  into  the  Latin  alphabet,  Ito  or  Ito,  are  frequently  used  in 
the  literature  and  are  equally  accepted.  In  this  textbook  we  follow  the  spelling  of  Ito’s  compatriot 
(Tanaka,  1996). 

4In  2006,  Ito  received  the  inaugural  Gauss  Prize  for  Applied  Mathematics  by  the  International 
Mathematical  Union,  which  is  awarded  every  fourth  year  since  then. 
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Thus,  for  the  ordinary  integral  it  follows 


f 


-f 


1 


f(s)  df(s)  =  /  f(s)fr(s)  ds  =  -/  (s) 


0 


1  (/2«  -/2(0)) . 


However,  among  other  things,  we  will  learn  that  the  Wiener  process  is  not  a 
differentiable  function  with  respect  to  time  t.  The  ordinary  chain  rule  does  not  apply 
and  for  the  according  Ito  integral  one  obtains 


/ 


W(s)dW(s )  =  -  (W2(s)  -  s) 


0 


=  1  f  w2(r)  -  IV2 (0)  -  t) 


(1.14) 


This  result  follows  from  the  famous  and  fundamental  lemma  by  Ito  being  a  kind  of 
“stochastified  chain  rule”  for  Wiener  processes  in  its  simplest  case.  Instead  of  (1.13) 
for  Wiener  processes  it  holds  that 


dW2(t )  1 

- kZ  =  W(t)  dW(t)  +  -dt.  (1.15) 

Substantial  generalizations  and  multivariate  extensions  will  be  discussed  in 
Chap.  11.  In  particular,  Ito’s  lemma  will  enable  us  to  solve  stochastic  differential 
equations  in  Chap.  12,  and  it  will  turn  out  that  S(t )  solving  (1.3)  is  a  so-called 
geometric  Brownian  motion.  In  Chap.  13  we  will  look  in  greater  detail  in  models 
for  interest  rates  as  e.g.  given  by  Eq.  (1.7). 

Starting  point  for  all  the  considerations  outlined  is  the  Wiener  process  -  often 
also  called  Brownian  motion.  Before  turning  to  it  and  its  properties,  general 
stochastic  processes  need  to  be  defined  and  classified  beforehand.  This  is  done  - 
among  other  things  -  in  the  following  chapter  on  basic  concepts  from  probability 
theory. 


1 .5  Problems  and  Solutions 

Problems 

1.1  Solve  the  differential  equation  (1.1),  i.e.  obtain  the  solution  (1.2). 

1.2  Verify  that  r(t)  from  (1.6)  solves  the  differential  equation  (1.4). 

1.3  Consider  a  simple  regression  model, 

yt  =  a  +  p  Xi  +  Si ,  i  =  1, . . .  ,n, 
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with  OLS  estimator  /3.  Check  that: 

o_R  _  ELi  fo  ~  /)  £» 

P  P  ,  — \2  ’ 

L/=i  (**  -  *) 

with  arithmetic  mean  U 
Hint:  J]"=1  (x,-  -x)  =  0. 

1.4  Let/"  denote  n-th  power  of  a  function/  with  derivative/',  n  e  N.  Show  that: 

fit)  =fn(0)+n  fr-\s)f(s)ds. 

Jo 

Hint:  Chain  rule  as  in  (1.13). 


Solutions 


1.1  Using  equation  (1.1)  we  get  by  integration 


cdr  —  ct. 


Since 


this  implies 


d\og(S(t))  =  S\t) 
dt  S(t) 


log(5(0)  -  log(5(0))  =  ct , 


or 


S(i)  = 

=  5(0)  ec* , 


which  is  the  required  solution. 


5  By  “log”  we  denote  the  natural  logarithm  to  the  base  e. 
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1.2  Taking  the  derivative  of  (1.6)  yields: 


dr(t) 

dt 


c  ect  (r(0)  —  fi) 


=  c  (r(t)  -  /x)  , 


where  again  the  given  form  of  r(t)  was  used.  By  purely  symbolically  multiplying 
by  dt  the  equation  (1.4)  is  obtained.  Hence,  the  problem  is  already  solved. 

1.3  It  is  well  known  that  the  OLS  estimator  is  given  by  “covariance  divided  by 
variance  of  the  regressor”,  i.e.  it  holds  that: 

2  =  -  *)&  -  y) 

iELife-*)2  ' 

Because  of  Y^l=  i  (-L  —  x)  =  0  this  simplifies  to 

n  =  XnT!l=\(Xi-x)yi 

\Y.U(xi-x)2' 

Assuming  the  model  to  be  correct  and  substituting  yt  =  a  +  /3  X(  +  Si,  one  obtains 


o  _  E"=  i  ( xi  ~  X)  (a  +  P  Xi  +  Si) 


Again  applying  the  argument  E;=i  (x;  —  x)  =  0  yields 


E'i=l(Xi  -  x)(P(xi  -x)  +  e<) 

ELife-x)2 

E”=i  (*.•  zEh 

P  E’Uxi-x)2' 


This  was  exactly  the  claim. 

1.4  We  address  the  problem  in  a  slightly  more  general  way.  Let  g  be  a  differentiable 
function  with  derivative  g'.  By  the  fundamental  theorem  of  calculus  it  holds  that 

t 

J  g'(s)ds  =  g(t) -g( 0), 

0 


6For  an  introduction  to  calculus  we  recommend  Trench  (2013);  this  book  is  available  electronically 
for  free  as  a  textbook  approved  by  the  American  Institute  of  Mathematics. 


10 


1  Introduction 


or 


t 

git)  =  g(0)  +  J  g'(s)ds. 

0 

If  g  describes  a  process  over  time,  this  last  relation  can  be  interpreted  the  following 
way:  The  value  at  time  t  is  made  up  by  the  starting  value  g(0)  plus  the  sum  or 
integral  over  all  changes  occurring  between  0  and  t.  Now,  choosing  in  particular 
g(t)  =fn(t)  with 


we  obtain  the  required  result. 
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Time  Series  Modeling 


Basic  Concepts  from  Probability  Theory 
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2.1  Summary 

This  chapter  reviews  some  basic  material.  We  collect  some  elementary  concepts 
and  properties  in  connection  with  random  variables,  expected  values,  multivariate 
and  conditional  distributions.  Then  we  define  stochastic  processes,  both  discrete  and 
continuous  in  time,  and  discuss  some  fundamental  properties.  For  a  successful  study 
of  the  remainder  of  this  book,  the  reader  is  required  to  be  familiar  with  all  of  these 
principles. 


2.2  Random  Variables 

Stochastic  processes  are  defined  as  families  of  random  variables.  This  is  why 
related  concepts  will  be  recapitulated  to  facilitate  the  definition  of  random  variables. 
Measure  theoretical  aspects,  however,  will  not  be  touched. 


Probability  Space 

We  denote  the  possible  set  of  outcomes  of  a  random  experiment  by  X2.  Subsets 
A,  A  c  X2,  are  called  events.  These  events  are  assigned  probabilities  to.  The 
probability  is  a  mapping 


A  i->  P(A)  g  [0,  1] ,  A  c  Q  , 


!Ross  (2010)  provides  a  nice  introduction  to  probability,  and  so  do  Grimmett  and  Stirzaker  (2001) 
with  a  focus  on  stochastic  processes.  For  a  short  reference  and  refreshing  e.g.  the  shorter  appendix 
in  Bickel  and  Doksum  (2001)  is  recommended. 
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2  Basic  Concepts  from  Probability  Theory 


which  fulfills  the  axioms  of  probability, 

•  P(A)  >  0, 

•  p(A)  =  i, 

•  P  (lm)  =  J2  PM,)  for  A,  n  Aj  =  0  with  i  ±  j, 

where  {A/}  may  be  a  possibly  infinite  sequence  of  pairwise  disjoint  events.  For  a 
well-defined  mapping,  we  do  not  consider  every  possible  event  but  in  particular 
only  those  being  contained  in  cr-algebras.  A  a -algebra  T  of  Q  is  defined  as  a 
system  of  subsets  containing 

•  the  empty  set  0, 

•  the  complement  Ac  of  every  subset  A  e  T  (this  is  the  set  Q  without  A,  Ac  — 
fl\A), 

•  and  the  union  (J  A/  of  a  possibly  infinite  sequence  of  elements  A/  e  T . 

i 

Of  course,  a  a -algebra  is  not  unique  but  can  be  constructed  according  to  problems 
of  interest.  The  interrelated  triple  of  set  of  outcomes,  a-algebra  and  probability 
measure,  (X2,  T ',  P),  is  also  called  a  probability  space. 

Example  2.1  ( Game  of  Dice )  Consider  a  fair  hexagonal  die  with  the  set  of  outcomes 

fl  =  {l,2,3,4,5,6}, 

where  each  elementary  event  {co}  c  Q  is  assigned  the  same  probability  to: 

P({1})  =  ...  =  P({6})=  1 

When  #(A)  denotes  the  number  of  elements  of  A  c  Q,  it  holds  in  the  example  of 
the  die  that 


The  probability  for  the  occurrence  of  A  hence  equals  the  number  of  outcomes 
leading  to  A  divided  by  the  number  of  possible  outcomes.  If  one  is  only  interested 
in  the  event  whether  an  even  or  an  odd  number  occurs, 


£  =  {2,4,6},  Ec  =  V\E={  1,3,5}, 


2  Sometimes  also  called  a  a -field,  which  motivates  the  symbol  T . 
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then  the  a -algebra  obviously  reads 

Ti  =  {&,E,EC,£2}. 

If  one  is  interested  in  all  possible  outcomes  without  any  qualification,  then  the 
a-algebra  chosen  will  be  the  power  set  of  £2,  V(£2).  This  is  the  set  of  all  subsets 
of  £2: 


Fi  =  V{£2)  =  {0,  {1> . {6},  {1,2} . {5, 6>,  {1, 2, 3> . «>. 

Systematic  counting  shows  that  V(£2)  contains  exactly  2#(^}  =  26  =  64  elements. 
With  one  and  the  same  probability  mapping  one  obtains  for  different  a -algebras 
different  probability  spaces: 


(fl,.Fi,P)  and  (fl,7*2,P).  ■ 


Random  Variable 

Often  not  the  events  themselves  are  of  interest  but  some  values  associated  with  them, 
that  is  to  say  random  variables.  A  real- valued  one-dimensional  random  variable  X 
maps  the  set  of  outcomes  £2  of  the  space  (£2,  T ',  P)  to  the  real  numbers: 

X:  £2  ->  R 

co  i — ^  X(co ) . 

Again,  however,  not  all  such  possible  mappings  can  be  considered.  In  particular,  a 
random  variable  is  required  to  have  the  property  of  measurability  (more  precisely: 
^-measurability).  This  implies  the  following:  A  subset  Ki  defines  an  event  of 
£2  in  such  a  way  that: 


x_1(fi)  :=  {ft>  €  S2  I  X(oj)  e  B}  . 

This  so-called  inverse  image  X~l  (B)  c  Q  of  B  contains  exactly  the  very  elements 
of  £2  which  are  mapped  by  X  to  B.  Let  B  be  a  family  of  sets  consisting  of  subsets 
of  R.  Then  as  measurability  it  is  required  from  a  random  variable  X  that  for  all 
B  e  B  all  inverse  images  are  contained  in  the  a-algebra  T\  X~l(B)  e  T .  Thereby 
the  probability  measure  P  on  T  is  conveyed  to  B,  i.e.  the  probability  function  Px 
assigning  values  to  X  is  induced  as  follows: 


Px(X  e  B)  =V(X-](B))  ,  B&B. 
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Thus,  strictly  speaking,  X  does  not  map  from  £2  to  R  but  from  one  probability  space 
to  another: 


X:  (fl,7\P)  -*  (R,B,PX)  , 

where  B  now  denotes  a  a-algebra  named  after  Emile  Borel.  This  Borel  algebra  B  is 
the  smallest  a-algebra  over  R  containing  all  real  intervals.  In  particular,  for  x  G  R 
the  event  X  <  x  has  an  induced  probability  leading  to  the  distribution  function  of 
X  defined  as  follows: 

Fx(x)  :=  Px(X  <  v)  =  Px  (X  g  (-00,  *])  =  P  (X_1  ((-00,  *]))  ,  v  g  R . 

Example  2.2  ( Game  of  Dice)  Let  us  continue  the  example  of  dice  and  let  us  define 
a  random  variable  X  assigning  a  gain  of  50  monetary  units  to  an  even  number  and 
assigning  a  loss  of  50  monetary  units  to  an  odd  number, 

1  i->  -50 

2  i->  +50 

X  :  3  -50 

4  1—  +50 

5  1—  -50 

6  1 — >  +50 

The  random  variable  X  operates  on  the  probability  space  (£2,2F\,  P)  known  from 
Example  2.1.  For  arbitrary  real  intervals  probabilities  Px  with  T\  —  {0,  E,  Ec,  £2} 
are  induced,  e.g.: 

Px  (X  e  [-100,  -50])  =  P  (X-1  ([-100,  -50]))  =  P(£f)  =  1 

Fx( 60)  =  Px  (X  e  (-oo,  60])  =  P  (X“‘  ((-oo,  60]))  =  P(fl)  =  1. 

Let  a  second  random  variable  Y  model  the  following  gain  or  loss  function: 

1  -10 
2  -20 
Y  :  3  -30 

4  -40 

5  i->  0 

6  i->  100 

As  in  this  case  each  outcome  leads  to  another  value  of  the  random  variable,  the 
probability  space  chosen  is  with  the  power  set  T 2  =  V(E2)  being  the 
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cr- algebra.  Then  we  obtain  for  Y  for  instance  the  following  probabilities: 

Fy(0)  =  Pv  (Y  <  0)  =  P  (Y~l  (-00,  0])  =  P  ({1,2,  3, 4,  5})  = 

Pv  (Y  e  [-20, 20))  =  P  (Y~l  [-20, 20))  =  P  ({1, 2, 5})  =  I. 

For  another  probability  space  the  mapping  Y  is  possibly  not  measurable  and 
therefore  it  cannot  be  a  random  variable.  E.g.  Y  is  not  T\ -measurable.  This  is  due 
to  the  fact  that  the  image  Y  —  0  has  the  inverse  image  Y~l  (0)  =  {5}  c  £2  which  is 
not  contained  in  T\  as  an  elementary  event:  {5}  T\.  ■ 


Continuous  Random  Variables 

For  most  of  all  problems  in  practice  we  do  not  explicitly  construct  a  random 
experiment  with  probability  P  in  order  to  derive  probabilities  Px  of  a  random 
variable  X.  Typically  we  start  directly  with  the  quantity  of  interest  X  modeling  a 
probability  distribution  without  inducing  it.  In  particular,  this  is  the  case  for  so- 
called  continuous  variables.  For  a  continuous  random  variable  every  value  taken 
from  a  real  interval  is  a  possible  realization.  As  a  continuous  random  variable  can 
therefore  take  uncountably  many  values  it  is  not  possible  to  calculate  a  probability 
P(*i  <  X  <  X2)  by  summing  up  the  individual  probabilities.  Instead,  probabilities 
are  calculated  by  integrating  a  probability  density.  We  assume  the  function /(v)  to 
be  continuous  (or  at  least  Riemann-integrable)  and  to  be  nonnegative  for  all  xei 
Then  /  is  called  (probability)  density  (or  density  function)  of  X  if  it  holds  for 
arbitrary  numbers  x\  <  X2  that 


rx  2 

P(vi  <  X  <  x2)  —  /  f{x)  dx. 

J  x  1 

The  area  beneath  the  density  function  therefore  measures  the  probability  with 
which  the  continuous  random  variable  takes  on  values  of  the  interval  considered. 
In  general,  a  density  is  defined  by  two  properties: 


l./(x)  >0  , 


Thus,  the  distribution  function  F(x)  —  P(X  <  x)  of  a  continuous  random  variable  X 
is  calculated  as  follows: 


fit)  dt. 
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If  there  is  the  danger  of  a  confusion,  we  sometimes  subscript  the  distribution 
function,  e.g.  Fx(0)  =  P(X  <  0). 


Expected  Value  and  Higher  Moments 


As  is  well  known,  the  expected  value  E(X)  (also  called  expectation)  of  a 
continuous  random  variable  X  with  continuous  density  /  is  defined  as  follows: 


•oo 


E(X)  =  f  xf(x)  dx. 

J—oo 

For  (measurable)  mappings  g,  transformations  g(X)  are  again  random  variables,  and 
the  expected  value  is  given  by: 


•  oo 


E[g(X)]=  f  g(x)f(x)dx. 

J—oo 

In  particular,  for  each  power  of  X  so-called  moments  are  defined  for  k  =  1,2, 

fik  =  E  [Xk] . 


Note  that  this  term  represents  integrals  which  are  not  necessarily  finite  (then  one 
says:  the  respective  moments  do  not  exist).  There  are  even  random  variables  whose 
density  /  allows  for  very  large  observations  in  absolute  value  with  such  a  high 
probability  that  even  the  expected  value  (i\  is  not  finite.  If  nothing  else  is  suggested, 
we  will  always  assume  random  variables  with  finite  moments  without  pointing  out 
explicitly. 

Often  we  consider  so-called  centered  moments  where  g(X)  is  chosen  as  (X  — 
E(X))/  For  k  —  2  the  variance  is  obtained  (often  denoted  by  a2)3 4: 

/oo 

(x  -  E (X))2f(x)  dx. 

-OO 

Elementarily,  the  following  additive  decomposition  is  shown: 

Var(X)  =  E(X2)  -  (E(X))2  =  fi2  -  lg.  (2.1) 


3  An  example  for  this  is  the  Cauchy  distribution,  i.e.  the  t-distribution  with  one  degree  of  freedom. 
For  the  Pareto  distribution,  as  well,  the  existence  of  moments  is  dependent  on  the  parameter  value; 
this  is  shown  in  Problem  2.2. 

4Then  cr  describes  the  square  root  of  Var(X)  with  positive  sign. 
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Since  Var(X)  >  0  by  construction,  this  gives  rise  to  the  following  inequality: 

(E(X))2  <  E  (X2)  .  (2.2) 

In  addition  to  centering,  for  higher  moments  a  standardization  is  typically  consid¬ 
ered.  The  following  measures  of  skewness  and  kurtosis  with  k  —  3  and  k  —  4, 
respectively,  are  widely  used: 

E  [(X-im)3]  E  [(X-im)4] 

Y\  =  - 3 - .  Y2  =  - 4 - • 

g5  cr4 

The  skewness  coefficient  is  used  to  measure  deviations  from  symmetry.  If  X 
exhibits  a  density  /  which  is  symmetric  around  the  expected  value,  it  obviously 
follows  that  y\  —  0.  The  interpretation  of  the  kurtosis  coefficient  is  more  difficult. 
Generally,  y 2  is  taken  as  a  measure  for  a  distribution’s  “peakedness”,  or  alternatively, 
for  how  probable  extreme  observations  (“outliers”)  are.  Frequently,  the  normal 
distribution  is  taken  as  a  reference.  For  every  normal  distribution  (also  called 
Gaussian  distribution,  see  Example  2.4)  it  holds  that  the  kurtosis  takes  the  value  3. 
Furthermore,  it  can  be  shown  that  it  holds  always  true  that 


Y2  >  1  , 


which  is  verified  in  Problem  2.1 


Example  2.3  ( Kurtosis  of  a  Continuous  Uniform  Distribution )  The  random  variable 
X  is  assumed  to  be  uniformly  distributed  on  [0,  b\  with  density 


fix)  = 


_  1  \ ’  -r  e  [0,fe] 


0, 


else 


As  is  well  known,  it  then  holds  that 

/I,  =  E  (X)  =  a2  =  Var(X)  =  ^. 

In  order  to  calculate  the  kurtosis  yi  we  are  interested  in  the  fourth  centered  moment: 


b 


(X-IM) 


x-bSU. 

2  b 


0 
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For  this  we  determine  (binomial  theorem): 


b 

x  —  - 


3  1  b4 

x4  -  2x3£  +  -x2&2  -  -x&3  +  — . 

2  2  16 


From  this  it  is  obtained  that 


b 


0 


and  hence 


x  —  -\  dx 


b5  b5  b5  b5  b5 

5~-T  +  T-  T+  16 


br 


80  ’ 


(x-my 


b L 


80 


The  kurtosis  coefficient  is  therefore  determined  as 


Y2  = 


(x-tny 


b4  ( 12 


o 


80  V  b2 


=  1.8. 


It  is  obvious  that  the  kurtosis  is  independent  of  b.  The  value  1.8  is  clearly  smaller 
than  3  indicating  that  the  uniform  distribution’s  curve  exhibits  a  flatter  behavior  than 
that  of  the  normal  distribution.  ■ 


Markov's  and  Chebyshev's  Inequality 

Consider  again  the  random  variable  X  with  variance  a2  =  Var(X).  Depending  on 
a2,  Chebyshev’s  inequality  allows  to  bound  the  probability  with  which  the  random 
variable  is  distributed  around  its  expected  value.  In  fact,  this  result  is  a  special  case 
of  the  more  general  Markov’s  inequality,  see  (2.3),  which  is  established  e.g.  in  Ross 
(2010,  Sect.  8.2).  A  proof  of  Chebyshev’s  result  given  in  (2.4)  will  be  provided  in 
Problem  2.3. 
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Lemma  2.1  (Markov’s  and  Chebyshev’s  Inequality)  Let  X  be  a  random  vari¬ 
able. 

(a)  IfX  takes  only  nonnegative  values,  then  it  holds  for  any  real  constant  a  >  0: 

,  x  E(X) 

P(X  >  a)  <  — (2.3) 

a 

(b)  with  a2  —  Var(X)  <  oo  it  holds  that 

o2 

P(\X  -  E(X)\  >  8)  <  — ,  (2.4) 

£z 

where  s  >  0  is  an  arbitrary  real  constant. 

Example  2.4  (Normal  Distribution)  The  density  of  a  random  variable  X  with 
normal  or  Gaussian  distribution  with  parameters  \i  and  o  >  0  goes  back  to  Gauss' 
and  is,  as  is  well  known, 


V  G  R, 


with 


E(X)  =  fi  and  Var(X)  =  a2. 

In  symbols  we  also  write  X  ~  JV(fi,(j2).  As  the  density  function  is  symmetric 
around  /x  it  follows  that  y\  —  0.  The  kurtosis  we  adopt  from  the  literature  without 
calculation  as  —  3.  Sometimes  we  use  this  result  for  determining  the  fourth 
centered  moment.  Under  normality  it  holds  that: 

E[(X  -  /xi)4]  =  3  (Varpf))2  . 

We  want  to  use  this  example  to  show  that  Chebyshev’s  inequality  may  be  not  very 
sharp.  For  example, 


P(|X  —  fi\  >  2  a)  < 


0.25. 


5The  traditional  German  spelling  is  GauB.  Carl  Friedrich  GauB  lived  from  1777  to  1855  and  was  a 
professor  in  Gottingen.  His  name  is  connected  to  many  discoveries  and  inventions  in  theoretical  and 
applied  mathematics.  His  portrait  and  a  graph  of  the  density  of  the  normal  distribution  decorated 
the  10-DM-bill  in  Germany  prior  to  the  Euro. 
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When  using  the  standard  normal  distribution,  however,  one  obtains  a  much  smaller 
probability  than  the  bound  due  to  (2.4): 

p  (\x  -  n\  >  2  a)  =  P  (^X~^  >  2 

2.3  Joint  and  Conditional  Distributions 

In  this  section  we  first  recapitulate  some  widely  known  results.  At  the  end  we 
introduce  the  more  involved  theory  of  conditional  expectation. 


=  2P 


X  —  fi 


<  -2 


0.044, 


< 7 


Joint  Distribution  and  Independence 


In  order  to  restrict  the  notational  burden,  we  only  consider  the  three-dimensional 
case  of  continuous  random  variables  X,  Y  and  Z  with  the  joint  density  function 
fXJjZ  mapping  from  M3  to  R.  For  arbitrary  real  numbers  a ,  b  and  c,  probabilities  are 
defined  as  multiple  (or  iterated)  integrals: 


/c  nb  pa 

/  /  fx,yZ(x,y,z)dxdydz. 

-OO  2  —  00  2  —  00 

As  long  as  /  is  a  continuous  function,  the  order  of  integration  does  not  matter,  i.e. 
one  obtains  e.g. 


P (X<a,Y  <b,Z<  c ) 


fJXJx,y,z)dzdydx 


fx,y,z(x ,  y,  z)dzdxdy . 


This  reversibility  is  sometimes  called  Fubini’s  theorem.6 

Univariate  and  bivariate  marginal  distributions  arise  from  integrating  the  respec¬ 
tive  variable: 


fx,y,z(x^y^)dydz, 


fx,y,z(x,y,z)dz. 


6  Cf.  Sydsaeter,  Strpm,  and  Berck  (1999,  p.  53).  A  proof  is  contained  e.g.  in  the  classical  textbook 
by  Rudin  (1976,  Thm.  10.2),  or  in  Trench  (2013,  Coro.  7.2.2);  the  latter  book  may  be  recommended 
since  it  is  downloadable  free  of  charge. 
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The  variables  are  called  stochastically  independent  if,  for  arbitrary  arguments,  the 
joint  distribution  is  given  as  the  product  of  the  marginal  densities: 

fx.yAx,y,z)  =fx(x)fy(y)fz(z)  , 

which  implies  pairwise  independence: 


fxJx,y)  =fx(x)fy(y) 


The  joint  probability 

/( c  pb  pa 

/  /  fx(x)fy(y)fz(z)dxdydz 

-OO  j  —  OO  j  —DC 


is,  under  independence,  factorized  to 


P(X  <  a,  Y  <  b,  Z  <  c) 


-f  / 

J  —DC  J —DC 


b 


My)f-(z) 


/a 

fx(x)  dx 
-oo 


dydz 


=  f  fz(z)  (  f  fy(y)dy\  f  fx(x)dx 

J  —  DC  (  J  —OO  )  \_J  ~  OO 

/a  pb  pc 

fx(x)dx  /  fy(y)dy  /  f-(z)dz 

-OO  J —OO  J  —  DC 

=  P(X  <  a)  P (Y  <  b)  P(Z  <  c). 


dz 


Covariance 


In  particular  for  only  two  variables  a  generalization  of  the  expectation  operator  is 
considered.  Let  h  be  a  real- valued  function  of  two  variables,  h\  M2  — >  R,  then  we 
define  as  a  double  integral: 

/oo  poo 

/  h(x,y)fXiy(x,y)dxdy. 

-OO  J  —  DC 

Hence,  the  covariance  between  X  and  Y  can  be  defined  as  follows: 


Cov(X,  Y)  :=  E[(X  -  E(X))(F  -  E(F))] 

=  E(XF)  -  E(X)E(F)  , 

where  the  finiteness  of  these  integrals  is  again  assumed  tacitly.  It  can  be  easily 
shown  that  the  independence  of  two  variables  implies  their  uncorrelatedness,  i.e. 
Cov(X,  F)  =  0,  whereas  the  reverse  does  not  generally  hold  true.  In  particular,  the 
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covariance  only  measures  the  linear  relation  between  two  variables.  In  order  to  have 
the  measure  independent  of  the  units,  it  is  usually  standardized  as  follows: 

Cov(X,  Y) 

Pxy  ~  yVailAVVarm  ' 

The  correlation  coefficient  pxy  is  smaller  than  or  equal  to  one  in  absolute  value,  see 
Problem  2.7. 

Example  2.5  (Bivariate  Normal  Distribution)  Let  X  and  Y  be  two  Gaussian 
random  variables, 


V(/rx,ax2),  Y  ~ 


with  correlation  coefficient  p.  We  talk  about  a  bivariate  normal  distribution  if  the 
joint  density  takes  the  following  form: 


fx,y(x,y)  = 


2 jiaxOy  y/l  -  p2 


<Px,y(x,y ) 


with  <px,y(v,  y)  equal  to 


1 


exp 


2(1  -  p2) 


x  B'x  \  2^  _|_  ( y  B'y 


a 


x 


G 


x 


G 


y 


a 


Symbolically,  we  denote  the  vector  as 


y 


X 


V2(/i,  zt). 


where  pt  is  a  vector  and  X  stands  for  a  symmetric  matrix: 


fi  — 


B'x 

lly 


,  = 


.2 

x 


at  Cov(X,  Y ) 


Cov(X,  Y)  o. 


y 


In  general,  the  covariance  matrix  is  defined  as  follows: 


(y-E(y))(*_Ep0,  y“E(y)) 


X  —  E 
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Note  that  in  the  case  of  uncorrelatedness  (p  =  0)  it  holds  that 


fx,y{x,y)  = 


=  fx(x)fy(y). 


(X  -  !lx)2  ) 

) 


(y  - 

2a2 

y 


The  joint  density  function  is  then  determined  as  the  product  of  the  individual 
densities.  Consequently,  the  random  variables  X  and  Y  are  independent.  Therefore  it 
follows,  in  particular  for  the  normal  distribution,  that  uncorrelatedness  is  equivalent 
to  stochastic  independence.  Furthermore,  bivariate  Gaussian  random  variables  have 
the  property  that  each  linear  combination  is  univariate  normally  distributed.  More 
precisely,  it  holds  for  A  g  M2  with  X'  =  (Ai,  A2)  that: 


AiX  +  A2T  -  Af(X'fi,  X'XX). 


Interesting  special  cases  are  obtained  with  X'  —  (1,  1)  and  X’  —  (1,-1)  for 
sums  and  differences.  Note  that  furthermore  for  multivariate  normal  distributions 
necessarily  all  marginal  distributions  are  normal  (with  X’  —  (1,0)  and  X’  —  (0, 1)). 
The  reverse  does  not  hold.  A  bivariate  example  for  Gaussian  marginal  distributions 
without  joint  normal  distributions  is  given  by  Bickel  and  Doksum  (2001,  p.  533).  ■ 


Cauchy-Schwarz  Inequality 

The  inequality  by  Cauchy  and  Schwarz  is  the  reason  why  \pxy\  <  1  applies.  The 
following  statement  is  verified  in  Problem  2.6. 

Lemma  2.2  (Cauchy-Schwarz  Inequality)  For  arbitrary  random  variables  Y  and 
Z  it  holds  that 


\E(YZ)\  <  VE(Y2)Ve(Z2),  (2.5) 

where  finite  moments  are  assumed. 

We  want  to  supplement  the  Cauchy-Schwarz  inequality  by  an  intermediate  inequal¬ 
ity,  see  (2.8).  For  this  purpose  we  remember  the  so-called  triangle  inequality  for 


7 Up  to  this  point  a  superscript  prime  at  a  function  has  denoted  its  derivative.  In  the  rare  cases 
in  which  we  are  concerned  with  matrices  or  vectors,  the  symbol  will  also  be  used  to  indicate 
transposition.  Bearing  in  mind  the  respective  context,  there  should  not  occur  any  ambiguity. 
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two  real  numbers: 


\a\  +  ci2 


<  \a\  |  +  |^2|- 


Obviously,  this  can  be  generalized  to: 


n 


E 


(2; 


If  the  sequence  is  absolutely  summable,  it  is  allowed  to  set  n  —  00.  This  suggests 
that  an  analogous  inequality  also  applies  for  integrals.  If  the  function  g  is  continuous, 
this  implies  continuity  of  \g\  and  one  obtains: 


\g(x)\dx. 


This  implies  for  the  expected  value  of  a  random  variable  X : 


EpO|  <E(|X|). 


(2.6) 


This  relation  resembles  (2.2);  in  fact,  both  relations  are  special  cases  of  Jensen’s 
inequality.  A  random  variable  is  called  integrable  if  E(|X|)  <  00.  Of  course  this 
implies  a  finite  expected  value.  For  integrability  a  finite  second  moment  is  sufficient, 
which  follows  from  (2.5)  with  Y  —  \X\  and  Z  —  1: 


E(|X|)  <  yEjxjVl1 


Now,  if  setting  X  —  YZ  in  (2.6),  it  follows  that:  |E(TZ)|  <  E(|F|  |Z|).  This  is  the 
bound  added  to  (2.5): 

|E(EZ)|  <  E(|y||Z|)  <  VE(E2)yE(Z2).  (2.8) 

The  first  inequality  follows  from  (2.6).  The  second  one  will  be  verified  in  the 
problem  section. 


8 The  general  statement  is:  for  a  convex  function  g  it  holds 


8  (E  (X))  <  E  (g(J 0)  ; 


(2.7) 


see  e.g.  Sydsaeter  et  al.  (1999,  p.  181),  while  a  proof  is  given  e.g.  in  Davidson  (1994,  Ch.  9)  or 
Ross  (2010,  p.  409). 


2.3  Joint  and  Conditional  Distributions 


27 


Conditional  Distributions 


Conditional  distributions  and  densities,  respectively,  are  defined  as  the  ratio  of  the 
joint  density  and  the  “conditioning  density”,  i.e.  they  are  defined  by  the  following 
density  functions  (where  positive  denominators  are  assumed): 


fx\y(x) 
fx\y,z(x) 
fx,y\z(x>  y) 


fx,y(x>  y) 

fy(y ) 

fx,y,z(x,y,z) 
fy,z(y >Z) 

fx,yAx>y>z) 

fz(z) 


It  should  be  clear  that  these  conditional  densities  are  in  fact  density  functions.  In 
case  of  independence  it  holds  by  definition  that  the  conditional  and  the  unconditional 
densities  are  equal,  e.g. 


This  is  very  intuitive:  In  case  of  two  independent  random  variables,  one  does  not 
have  any  influence  on  the  probability  with  which  the  other  takes  on  values. 


Conditional  Expectation 

If  the  random  variables  X  and  Y  are  not  independent  and  if  the  realization  of  Y  is 
known,  Y  —  y,  then  the  expectation  of  X  will  be  affected: 

oo 


/oo 

xfx\y(x)dx 

-OO 


Analogously,  we  define  the  conditional  expectation  of  a  random  variable  Z,  Z 
h(X,  T),  h  :  M2  — >  R,  given  Y  —  y  as: 


E(Z\Y  =  y)=E(h(X,Y)\Y  =  y) 

/oo 

h{x,y)fx\y(x)  dx. 

-OO 


In  particular,  for  h(X,Y)  —  X  g(Y)  with  g  :  R  — >  R  one  therefore  obtains 


E(Xg(Y)  |  Y  =  y)  =  g(y) 


/OO 

XfX\y(x) 

-OO 


=  g(y)E(X|y  =  y). 


dx 
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Here,  the  marginal  density  of  X  is  replaced  by  the  conditional  density  conditioned 
on  the  value  Y  =  y. 

Technically,  we  can  calculate  the  density  conditioned  on  the  random  variable  Y 
instead  of  conditioned  on  a  value  Y  —  y: 


fxyjx,  Y) 
fyVO 


Byfx\Y(x)  a  transformation  of  the  random  variable  Y  and  consequently  a  new  random 
variable  is  obtained.  This  is  also  true  for  the  related  conditional  expectations: 


E(h(X,Y)\Y) 


xfx\Y(x)dx, 


h(x,  Y)fx\Y{x)dx. 


As  this  is  about  random  variables,  it  is  absolutely  reasonable  to  determine  the 
expected  value  over  the  conditional  expectation.  This  calculation  can  be  carried  out 
applying  a  rule  called  the  “law  of  iterated  expectations  (LIE)”  in  the  literature;  it  is 
given  in  Proposition  2.1.  In  order  to  prevent  confusion  whether  X  or  Y  is  integrated, 
it  is  advisable  to  subscript  the  expectation  operator  accordingly: 


/oo 

-oo 


Ey[Ex(X\Y)]  =  [Ex(X\  y)]fy(y)dy  = 


/oo  r  n oo 
-oo  L 2—00 


xfx\y(x)dx 


fy(y)dy. 


Although  Y  and  g(y)  are  random  variables,  after  conditioning  on  Y  they  can  be 
treated  as  constants  and  in  case  of  a  multiplicative  composition,  they  can  be  put  in 
front  of  the  expected  value  when  integration  is  with  respect  to  X.  This  is  the  second 
statement  in  the  following  proposition,  also  cf.  Davidson  (1994,  Theorem  10.10). 
The  first  statement  will  be  derived  in  Problem  2.9. 


Proposition  2.1  (Conditional  Expectation)  With  the  notation  introduced  above ,  it 
holds  that: 

(a)  Ey[Ex(X\Y)]  =  Ex(X), 

(b)  Eh(g(Y)X\Y)  =  g(Y)Ex(X\Y)  for  h(X,  Y)  =  Xg(Y). 


9 This  is  not  a  really  rigorous  way  of  introducing  expectations  conditioned  on  random  variables. 
A  mathematically  correct  exposition,  however,  requires  measure  theoretical  arguments  not  being 
available  at  this  point;  cf.  for  example  Davidson  (1994,  Ch.  10),  or  Klebaner  (2005,  Ch.  2).  More 
generally,  one  may  define  expectations  conditioned  on  a  cr-algebra,  E(X| Q),  where  Q  could  be  the 
cr-algebra  generated  by  Y:  Q  =  o(Y). 


2.4  Stochastic  Processes  (SP) 
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Frequently,  we  formulate  these  statements  in  a  shorter  way, 

E[E(X|F)]  =  E(X), 

Ete(y)x|y)  =  g(y)E(x|y), 

if  there  is  no  risk  of  misunderstanding. 


2.4  Stochastic  Processes  (SP) 

In  this  section  stochastic  processes  are  defined  and  classified.  In  the  following 
chapters  we  will  be  confronted  with  concrete  types  of  stochastic  processes. 


Definition 

A  univariate  stochastic  process  (SP)  is  a  family  of  (real- valued)  random  variables, 
{X(t;  co)}te t,  for  a  given  index  set  T: 

X  :  T  x  Q  ->  R 

(t;  co)  i->  X(t;  co) . 

The  subscript  t  e  T  is  always  to  be  interpreted  as  “time”.  At  a  fixed  point  in  time  to 
the  stochastic  process  is  therefore  simply  a  random  variable, 

X:  Q  ->  R 

co  i — ^  X(jo  ',  co') . 

A  fixed  coo ,  however,  results  in  a  path,  a  trajectory  or  a  realization  of  a  process  which 
is  also  often  referred  to  as  time  series, 

X:  T  ->  R 

t  i — ^  X(jt\  fUo)  • 

In  fact,  a  stochastic  process  is  a  rather  complex  object.  In  order  to  characterize  it 
mathematically,  random  vectors  of  arbitrary,  finite  length  n  at  arbitrary  points  in 
time  t\  <  •  •  •  <  tn  have  to  be  considered: 

x„  (tj )  :=  (X(tuco),  X(t„;  co))'  ,  h  <  •  •  •  <  t„  ■ 
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The  multivariate  distribution  of  such  an  arbitrary  random  vector  characterizes 
a  stochastic  process.  In  particular,  certain  minimal  requirements  for  the  finite¬ 
dimensional  distribution  of  Xn(ti)  guarantee  that  a  stochastic  process  exists  at  all. 

Depending  on  the  countability  or  non-countability  of  the  index  set  T,  discrete¬ 
time  and  continuous-time  SPs  are  distinguished.  In  the  case  of  sequences  of  random 
variables,  we  talk  about  discrete- time  processes,  where  the  index  set  consists  of 
integers  ,  T  c  N  or  T  c  Z.  For  discrete-time  processes  we  agree  upon  lower  case 
letters  as  an  abbreviation  without  explicitly  denoting  the  dependence  on  co , 

{xt} ,  t  e  T  for  {X(t;  • 

For  so-called  continuous-time  processes  the  index  set  T  is  a  real  interval,  T  = 
[a,  b]  c  M,  frequently  T  =  [0,  T\  or  T  =  [0,1],  however,  open  intervals  are  also 
admitted.  For  continuous-time  processes  we  also  suppress  the  dependence  on  co 
notationally  and  write  in  a  shorter  way 

X(t),teT  foY{X(t;(o)}tej. 


Stationary  and  Gaussian  Processes 

Consider  again  generally  an  arbitrary  vector  of  the  length  n , 

Xn(ti)  =  (X(ti;co),  X(t„:  co))'  . 

If  Xn  ( ti )  is  jointly  normally  distributed  for  all  n  and  6,  then  X(t;  co)  is  called  a  normal 
process  (also:  Gaussian  process).  Furthermore,  we  talk  about  a  strictly  stationary 
process  if  the  distribution  is  invariant  over  time.  More  precisely,  Xn(ti)  follows  the 
same  distribution  as  a  vector  which  is  shifted  by  s  units  on  the  time  axis. 


X'n(ti  +  s)  —  (X (t \  +  s’,  co),  . . . ,  X(tn  +  s;  co))  . 


The  distributional  properties  of  a  strictly  stationary  process  do  not  depend  on 
the  location  on  the  time  axis  but  only  on  how  far  the  individual  components 
X(ti\  co)  are  apart  from  each  other  temporally.  Strict  stationarity  therefore  implies 


10These  “consistency”  requirements  due  to  Kolmogorov  are  found  e.g.  in  Brockwell  and  Davis 
(1991,  p.  11)  or  Grimmett  and  Stirzaker  (2001,  p.  372).  A  proof  of  Kolmogorov’s  existence  theorem 
can  be  found  e.g.  in  Billingsley  (1986,  Sect.  36). 

11  The  convention  of  using  upper  case  letters  for  continuous-time  process  is  not  universal. 
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the  expected  value  and  the  variance  to  be  constant  (assuming  they  are  finite)  and  the 
autocovariance  for  two  points  in  time  to  depend  only  on  the  temporal  interval: 

1.  E  ( X(t ;  co))  =  pcx  for  t  e  T  , 

2.  Cov  ( X(t ;  co),  X(t  +  h\  co))  —  yx(h)  for  all  t,  t  +  h  e  T  , 
and  therefore  in  particular 

Var  ( X(t ;  co))  =  yx( 0)  for  all  t  e  T . 

A  process  (with  finite  second  moments  E[(X(t;  &>))  ])  fulfilling  these  two  conditions 
(without  necessarily  being  strictly  stationary)  is  also  called  weakly  stationary 
(or:  second-order  stationary).  Under  stationarity,  we  define  as  autocorrelation 
coefficient  also  independent  of  t : 


Yx(h) 

Yx(  0)  ' 


Synonymously  to  autocorrelation  we  also  speak  of  serial  or  temporal  correlation. 
For  weak  stationarity  not  necessarily  the  whole  distribution  is  invariant  over  time, 
however,  at  least  the  expected  value  and  the  autocorrelation  structure  are  constant. 

In  the  following,  the  term  “stationarity”  always  refers  to  the  weak  form  unless 
stated  otherwise. 


Example  2.6  ( White  Noise  Process )  In  the  following  chapters,  {ej  often  denotes 
a  discrete-time  process  {s(t;  co)}  free  of  serial  correlation.  In  addition  we  assume  a 
mean  of  zero  and  a  constant  variance  a2  >  0,  i.e. 


E(e*)  =  0  and  E(£r£5)  = 


<72  ,  t  —  S 
0  ,  t  7^  S 


By  definition  such  a  process  is  weakly  stationary.  We  typically  denote  it  as 


{£t}  -  WN(0,  a2). 


The  reason  why  such  a  process  is  called  white  noise  will  be  provided  in  Chap.  4.  ■ 

Example  2.7  (Pure  Random  Process )  Sometimes  {st}  from  Example  2.6  will 
meet  the  stronger  requirements  of  being  identically  and  independently  distributed. 
Identically  distributed  implies  that  the  marginal  distribution 


Fi(e)  =  P  (et.  <  e)  =  F(e) ,  i  =  1, . . . ,  n  , 
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does  not  vary  over  time.  Independence  means  that  the  joint  distribution  of  the  vector 

^ n,ti  0^1  ’  •  •  •  ’  ^hi) 

equals  the  product  of  the  marginal  distributions.  As  the  marginal  distributions  are 
invariant  with  respect  to  time,  this  also  holds  for  their  product.  Thus,  {ef}  is  strictly 
stationary.  In  the  following  it  is  furthermore  assumed  that  st  has  zero  expectation 
and  the  finite  variance  a2.  Symbolically,  we  also  write  2 : 

{e,}  ~  iid(0,  a2). 

A  stochastic  process  with  these  properties  is  frequently  called  a  pure  random 
process.  Clearly,  an  iid  (or  pure  random)  process  is  white  noise.  ■ 


Markov  Processes  and  Martingales 

A  SP  is  called  a  Markov  process  if  all  information  of  the  past  about  its  future 
behavior  is  entirely  concentrated  in  the  present.  In  order  to  capture  this  concept 
more  rigorosly,  the  set  of  information  about  the  past  of  the  process  available  up  to 
time  t  is  denoted  by  Xt.  Frequently,  the  information  set  is  also  referred  to  as 

Xt  =  a  (X(r;a)),  r  <  t)  , 

because  it  is  the  smallest  a -algebra  generated  by  the  past  and  presence  of  the  process 
X(r;co)  up  to  time  t.13  The  entire  information  about  the  process  up  to  time  t  is 
contained  in  Xt.  A  Markov  process,  so  to  speak,  does  not  remember  how  it  arrived 
at  the  present  state:  The  probability  that  the  process  takes  on  a  certain  value  at  time 
t  +  s  depends  only  on  the  value  at  time  t  (“present”)  and  does  not  depend  on  the  past 
behavior.  In  terms  of  conditional  probabilities,  for  s  >  0  the  corresponding  property 
reads: 


P  (X(t  +  s;co)  <  x 


Xt)  —  P  (X(t  +  s;  co)  <  x  |  X(t\  co) )  . 


(2.9) 


A  process  is  called  a  martingale  if  the  present  value  is  the  best  prediction  for 
the  future.  A  martingale  technically  fulfills  two  properties.  In  the  first  place,  it  has 
to  be  (absolutely)  integrable,  i.e.  it  is  required  that  (2.10)  holds.  Secondly,  given  all 


12The  acronym  stands  for  “independently  identically  distributed”. 

13 By  assumption,  the  information  at  an  earlier  point  in  time  is  contained  in  the  information  set  at  a 
subsequent  point  in  time:  Xt  C  Tt+S  for  s  >  0.  A  family  of  such  nested  cr-algebras  is  also  called 
“filtration”. 
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information  Tt,  the  conditional  expectation  only  uses  the  information  at  time  t.  More 
precisely,  the  expected  value  for  the  future  is  equal  to  today’s  value.  Technically,  this 
amounts  to: 


E  (|X(f;  co)\)  <  oo ,  (2.10) 

E  (X(£  +  s;  co)  |  Tt)  —  X(t;  co) ,  s  >  0.  (2.11) 

Note  that  the  conditional  expectation  is  a  random  variable.  Therefore,  strictly 
speaking,  equation  (2.11)  only  holds  with  probability  one. 


Martingale  Differences 

Now,  let  us  focus  on  the  discrete- time  case.  A  discrete-time  martingale  is  defined  by 
the  expectation  at  time  t  for  t+ 1  being  given  by  the  value  at  time  t.  This  is  equivalent 
to  expecting  a  zero  increment  from  t  to  t  +  1 .  Therefore,  this  concept  is  frequently 
expressed  in  form  of  differences.  We  then  talk  about  martingale  differences.  As 
we  will  see,  in  a  sense,  such  a  property  is  settled  between  uncorrelatedness  and 
independence  and  is  interesting  from  both  an  economic  and  a  statistical  point  of 
view. 

We  again  assume  an  integrable  process,  i.e.  {vr}  fulfills  (2.10).  It  is  called 
a  martingale  difference  (or  martingale  difference  sequences)  if  the  conditional 
expectation  (given  its  own  past)  is  zero: 


E(xr+i|a(x,,x,_i, . . .))  =  0. 

This  condition  states  concretely  that  the  past  does  not  have  any  influence  on 
predictions  (conditional  expectation);  i.e.  knowing  the  past  does  not  lead  to  an 
improvement  of  the  prediction,  the  forecast  is  always  zero.  Not  surprisingly,  this 
also  applies  if  only  one  single  past  observation  is  known  (see  Proposition  2.2(a)). 
Two  further  conclusions  for  unconditional  moments  contained  in  the  proposition 
can  be  verified,14  see  Problem  2.10:  martingale  differences  are  zero  on  average  and 
free  of  serial  correlation.  In  spite  of  serial  uncorrelatedness,  martingale  differences 
in  general  are  on  no  account  independent  over  time.  What  is  more,  they  do  not  even 
have  to  be  stationary  as  it  is  not  ruled  out  that  their  variance  function  depends  on  t. 


14We  cannot  prove  the  first  statement  rigorously,  which  would  require  a  generalization  of 
Proposition  2.1(a).  The  more  general  statement  taken  e.g.  from  Breiman  (1992,  Prop.  4.20)  or 
Davidson  (1994,  Thm.  10.26)  reads  in  our  setting  as 


E[E(xt\Tt-i)\xt-h]  =  E(xt\xt-h )  . 
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Proposition  2.2  (Martingale  Differences)  For  a  martingale  difference  sequence 
{xt}  with  Xt  —  o(xs ,  s  <t)  it  holds  that 

(a)  E(x, \xt-h)  =  0/or  h  >  0, 

( b )  E(xt)  =  0, 

(c)  Cov{xt,xt+h)  =  E(x,xt+h )  =  0 fork  ±  0 
for  all  t  e  T. 

Note  that  a  stationary  martingale  difference  sequence  has  a  constant  variance  and 
is  thus  white  noise  by  Proposition  2.2.  The  concept  should  be  further  clarified  by 
means  of  an  example. 

Example  2.8  ( Martingale  Difference )  Consider  the  process  given  by 

£f  r. 

Xt  =  xt- 1 - ,  t  €  {2, . . . ,  n) ,  {£,}  ~  iid(0,  o  ) , 

Et- 2 

with  x\  —  £\  and  £q  =  1.  From  this  it  follows  that  v 2  =  x\—  —  £\£2  and  by 
continued  substitution: 


Xt  —  £/— 1  £t  >  t  —  2 , ,n. 

We  want  to  show  that  this  is  a  martingale  difference  sequence.  Therefore,  we  note 
that  the  past  of  the  pure  random  process  can  be  reconstructed  from  the  past  of  xt: 

_  X2  _  x2  _  x3  _  xt 

£2  ?  ^3  j  •  • •  j  Ef  • 

£\  X\  £2  £t—  l 


Therefore,  the  information  set  Xt  constructed  from  {xt, . . . ,  x\ }  contains  not  only  the 
past  values  of  xt+\,  but  also  the  ones  of  the  iid  process  up  to  time  t.  Thus,  it  holds 
that 


E(x,+i|J,)  =  E(e,e,+i  \It) 

=  e,E(er+i|X,) 

=  e,E(e,+i) 

=  0. 

The  first  equality  follows  from  the  definition  of  the  process.  The  second  equality 
is  accounted  for  by  Proposition  2.1(b).  The  third  step  is  due  to  the  independence 
of  £t+\  of  the  past  up  to  t ,  that  is  why  conditional  and  unconditional  expectation 
coincide.  Finally,  by  assumption,  e,+i  is  zero  on  average.  All  in  all,  by  this  the 
property  of  martingale  differences  is  established.  Therefore,  {xt}  is  free  of  serial 
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correlation,  however,  it  is  serially  (i.e.  temporally)  dependent,  which  is  obvious  from 
the  recursive  definition.  ■ 

A  prominent  class  of  martingale  differences  are  the  ARCH  processes  treated  in 
Chap.  6. 


2.5  Problems  and  Solutions 


Problems 


2.1  Prove  for  the  kurtosis  coefficient:  y>2  >  1. 


2.2  Let  X  follow  a  Pareto  distribution  with 


f(x)  =  0  x  0  l,  x  >  1,  0  >  0. 


Prove  that  X  has  finite  k-th  moments  if  and  only  if  6  >  k. 

2.3  Prove  Chebyshev’s  inequality  (2.4). 


2.4  Consider  a  bivariate  distribution  with: 


fx,y(x,y )  = 


_  ;  L  (*.)0  e  [0,a]  x  [0,fo] 


0, 


else 


Prove  that  X  and  Y  are  stochastically  independent. 

2.5  Calculate  the  expected  values,  variances  and  the  correlation  of  X  and  Y  from 
Example  2.2. 

2.6  Prove  the  second  inequality  from  (2.8). 

2.7  Prove  for  the  correlation  coefficient  that  \pxy\  <  1. 

2.8  Consider  a  bivariate  logistic  distribution  function  for  X  and  Y: 


FxJx,y)  =  (1  +  e  x  +  e  y)  *, 


where  x  and  y  from  R  are  arbitrary.  What  does  the  conditional  density  function  of  X 
given  Y  —  y  look  like? 

2.9  Prove  statement  (a)  from  Proposition  2. 1 . 
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2.10  Derive  the  properties  (b)  and  (c)  from  Proposition  2.2. 
Hint:  Use  statement  (a). 


Solutions 

2.1  Assuming  finite  fourth  moments  we  define  for  a  random  variable  X  with  fi  i  = 
E(X): 

E(X-Ml)4 

Y2  =  - J - • 

cr4 

Consider  the  standardized  random  variable  Z  with  expectation  0  and  variance  1 : 

Z  =  - —  with  E(Z2)  =  1. 

a 

For  this  random  variable,  it  holds  that  y 2  =  E(Z4).  Replacing  X  by  Z2  in  (2.2),  it 
follows 

1  =  (E(Z2))2  <  E  (Z4)  =  K2  , 


which  proves  the  claim. 

2.2  For  the  k-th  moment  it  holds: 


/oo  r 

xkf(x)dx  — 

-00  J 1 


00 


O^-'dx. 


1.  case:  If  0  ^  k,  then  the  antiderivative  results  in 


/ 


e  dx  = 


9 


k  —  0 


x 


k—0 


The  corresponding  improper  integral  is  defined  as  limit: 


/ 


00 


0  ^  0  1  dx  =  lim 


6 


M-+00  k  —  0 


x 


.k—6 


For  0  >  k  it  follows  that 


r  &  /- e-idx =o~dx = 

J 1  k  —  0  0  —  k 


<  00. 


For  6  <  k,  however,  no  finite  value  is  obtained  as  Mk  6  goes  off  to  infinity. 
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2.  case:  For  6  —  k  the  antiderivative  takes  on  another  form: 

J  0  xk~°~ldx  —  J  0  x~l  dx  —  0  log(v). 

As  the  logarithm  is  unbounded,  or  log(M)  ->  oo  for  M  ->  oo,  one  cannot  obtain 
a  finite  expectation  either,  as  the  upper  bound  of  integration  is  oo. 

Both  cases  jointly  prove  the  claim. 

2.3  We  provide  two  proofs.  The  first  one  builds  on  the  fact  that  (2.4)  is  a  special 
case  of  (2.3).  The  second  one  is  less  abstract  and  more  elementary,  and  hence 
instructive,  too. 

1.  Note  that  ( X  —  /i)2  is  a  nonnegative  random  variable.  Therefore,  (2.3)  applies 
with  a  —  s2: 


P((X  -  \i)2  >  e 2)  < 


E((X  -  n)2) 

£ 2 


The  event  (X  —  /x)2  >  £ 2,  however,  is  equivalent  to  \X  —  /x |  >  e,  which 
establishes  (2.4). 

2.  Elementarily,  we  prove  the  claim  for  the  case  that  X  is  a  continuous  random  vari¬ 
able  with  density  function/;  the  discrete  case  can  be  accomplished  analogously. 
Note  the  following  sequence  of  inequalities: 


Var(X)  =  /  (x  —  /jl)  f(x)dx 


-l 

L 


oo 


— oo 

fl— £ 


— OO 
11— £ 


POO 

/  1 

j  fl  +  £ 


(v  —  jl)  f(x)dx  +  /  (x  —  fi)  f(x)dx 


> 


/fi- 

-oo 


POO 

J  H+e 


£2f(x)dx  +  £~f(x)dx. 


The  first  inequality  is  of  course  due  to  the  omittance  of 


>  0. 


The  second  one  is  accounted  for  by  the  fact  that  for  the  integrands  of  the 
respective  integrals  it  holds  that: 


x  —  fi  <  —£  for  x  <  {I  —  £ 
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and 


x  —  fi  >  s  for  x  >  fl  -\-  £. 


Up  to  this  point,  it  is  therefore  shown  that: 


Var(X)  >  s2 P(X  <  /x  —  s)  +  £2P(X  >  fi  +  s) 
—  £2P(\X  —  /I  |  >  s). 


This  is  equivalent  to  the  claim. 

2.4  The  marginal  density  is  obtained  as  follows: 


fx(x) 


-L 


oo 


fxJx,y)dy 


—  OO 

b 


1 


f 

Jo  ab 
b  —  0 
ab 


dy 


1  r  n 
=  -  for  x  e  [0,  a] , 
a 


and  fx(x)  =  0  for  v  ^  [0,  a].  It  also  holds  that 


fy(y)  = 


_  }\ ,  y  G  [o,  b\ 


0 ,  else 


Hence,  one  immediately  obtains  for  all  v  and  y: 

fxj(x,y)  =fx(x)fy(y), 

which  was  to  be  proved. 

2.5  Obviously,  the  expected  value  of  X  is  zero, 

E(X)  =  50  •  Px(X  =  50)  -  50  •  Vx(X  =  -50)  =  0. 

Therefore,  it  holds  for  the  variance  that: 

Var(X)  =  E[(X  -  E(X))2]  =  E(X2) 

=  502  •  Px(X  =  50)  +  (— 50)2  •  Px(X  =  -50) 

2500  2500 

=  - + - 


2 


2 


=  2500. 
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Also  Y  is  zero  on  average: 

E (Y)  =  -10  •  P y(Y  =  -10)  -  20  •  P y(Y  =  -20)  -  30  •  P y(Y  =  -30) 

-  40  •  P y(Y  =  -40)  +  0  •  P y(Y  =  0)  +  100  •  P y(Y  =  100) 

=  ^  (-10  -  20  -  30  -  40  +  100)  =  0  . 

Hence,  the  variance  reads 

Var(F)  =  -  ((— 10)2  +  (— 20)2  +  (-30)2  +  (-40)2  +  02  +  1002)  =  2166.67. 
6 

For  the  covariance  we  obtain 

Cov(X,  Y)  =  E  [(X  -  E(X))  (Y  -  E(F))] 

=  E(X  Y) 

2  6 

—  y  ]  y  ]  xiyjEx,y  (x  =  Xi,  Y  =  yj) . 
i=  1  j=  1 

In  order  to  compute  it,  the  entire  joint  probability  distribution  is  to  be  established: 
p x,y(X  =  -50^  y  =  -40)  =  P({1, 3,  5}  n  {4})  =  P(0)  =  0, 

P X,y(x  =  50,  y  =  -40)  =  P({2, 4,  6}  n  {4})  =  P({4})  =  i 
p X,y(x  =  -50,  Y  =  -30)  =  P (Ec  n  {3})  =  P({3})  = 

o 

P x,y(X  =  50,  Y  =  -30)  =  P (E  n  {3})  =  P(0)  =  0. 

We  may  collect  those  numbers  in  a  table: 


Y  = 

-40 

-30 

-20 

-10 

0 

100 

X  =  -50 

0 

l 

6 

0 

1 

6 

1 

6 

0 

X  =  50 

l 

6 

0 

l 

6 

0 

0 

1 

6 

Plugging  in  yields 

E(X  Y)  =  -  [-50  •  40  +  50  •  30  -  50  •  20  +  50  •  10  +  50  •  0  +  50  •  100]  =  666.67. 
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Therefore  one  obtains  for  the  correlation  coefficient  apart  from  rounding  errors 

pxy  —  0.286. 

2.6  It  only  remains  to  be  shown  that: 


E(|y||z|)  <  Ve<t2)Ve(zT 


In  order  to  see  that  we  use  the  binomial  formula  and  obtain 


Y 2  2|y||Z|  Z2 

E  <T2)  v'E  (E2)  VE(Z2)  +  E(Z2) 


>  0. 


Therefore,  the  expectation  of  the  left  hand  side  cannot  become  negative,  which 
yields: 


,  _  2E(|y||Z|)  +  t  =  9  /  _  E(|y||Z|)  \  >  n 

VE^Ve^  v  ■ 

In  particular,  it  can  be  observed  that  the  expression  is  always  positive  except  for  the 
case  Y  —  Z.  Rearranging  terms  verifies  the  second  inequality  from  (2.8). 

2.7  Plugging  in  X  —  E(X)  and  Y  —  E(T)  instead  of  Y  and  Z  in  (2.5)  by 
Cauchy-Schwarz  it  follows  that 

|E[(x-E(x))(y-E(y))]|  <  v'E[(x-E(x))2]VE[(y-E(y))2], 

which  is  the  same  as: 


|Cov(X,y)|  <  yVar(X)VVar(y). 


This  verifies  the  claim. 

2.8  Due  to 

FxJx,y)  =  t  (  fx,y(r.  s)dr  ds, 

J—oo  J—OO 

fXty  is  determined  by  taking  the  partial  derivative  of  Fx,y  with  respect  to  both 
arguments: 


&Fxy(x,y) 
3x3  y 


_  3(1  +  c~x  +  e~y)-2e~x 

3 y 

_  2e~xe~y 
~  (1+  +  e~yf 

=  fx,y(x,y). 
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The  marginal  distribution  of  Y  is  determined  by 


Fy  60  = 


/y 

-oo  J — oo 


fx,y(x ,  s)dxds 


=  lim  FxJx,y )  =  (1  +  e  ') 


-n-i 


x— >oo 


The  marginal  density  therefore  reads 


e~y 

fy(y)  _  (!  _|_  e-yy  • 
Division  yields  the  conditional  density: 


fx,y{x,y) 

fy(y ) 

2e~x(l  +  e-y)2 
(1  +  e~x  +  e~y)3  ' 


2.9  The  following  sequence  of  equalities  holds  and  will  be  justified  in  detail.  The 
first  two  equations  define  exactly  the  corresponding  (conditional)  expectations.  For 
the  third  equality,  the  order  of  integration  is  reversed;  this  is  due  to  Fubini’s  theorem. 
The  fourth  equation  is  again  by  definition  (conditional  density),  whereas  in  the  fifth 
equation  only  the  density  of  Y  is  cancelled  out.  In  the  sixth  equation,  the  influence  of 
Y  on  the  joint  density  is  integrated  out  such  that  the  marginal  density  of  X  remains. 
This  again  yields  the  expectation  of  X  by  definition.  Therefore,  it  holds  that 


oo 


EV(E,(X|K))  =  j  Ex(X\y)fy(y)dy 


—  OO 


OO 


—  oo 


oo 


=  /  J  xfx\y  (*)  dx 


—  oo 


fy  (y)  dy 


oo 

p 

oo 

x 

/  fx\y  (x)  fy  (y)  dy 

/ 

oo 

J 

-oo 

dx 


oo 


■/'  / 


—  oo 


oo 


fxy  (X,  y) 
fy(y) 


fy  (y)  dy 


—  oo 


dx 
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which  was  to  be  verified. 

2.10  We  use  statement  (a),  E(xt\xt-h)  =  0  for  h  >  0,  connected  with  the  law  of 
iterated  expectations: 


Efo)  =  E[E(xt\xt-h)]  =  E(0)  =  0. 

This  proves  (b),  that  martingale  differences  are  also  unconditionally  zero  on  average. 
By  applying  both  results  of  Proposition  2.1  for  h  >  0  again  with  (a),  one  arrives  at: 


Efc  Xf-\-h)  —  E[  E(jfy  Xt+h  |.u)] 

=  E[xtE(xt+h\xt)] 

=  E[xt  •  0] 

=  0. 

Therefore,  Cov(v?,  xt+h)  —  0  for  h  >  0.  However,  as  the  covariance  function  is 
symmetric  in  h ,  the  result  holds  for  arbitrary  h  ^  0  which  was  to  be  verified  to 
show  (c). 
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Autoregressive  Moving  Average  Processes 
(ARMA) 


3.1  Summary 

This  chapter  is  concerned  with  the  modelling  of  serial  correlation  (or  autocorrela¬ 
tion)  that  is  characteristic  of  many  time  series  dynamics.  To  that  end  we  cover  a  class 
of  stochastic  processes  widely  used  in  practice.  They  are  discrete-time  processes: 
{xt}teT  with  T  c  Z.  Throughout  this  chapter,  the  innovations  or  shocks  {£r}  behind 
{xt}  are  assumed  to  form  a  white  noise  sequence  as  defined  in  Example  2.6.  The 
next  section  treats  the  rather  simple  moving  average  structure.  The  third  section 
addresses  the  inversion  of  lag  polynomials  at  a  general  level.  The  fourth  section 
breaks  down  the  technical  aspects  to  the  application  with  ARMA  processes. 


3.2  Moving  Average  Processes 

We  define  the  moving  average  process  of  order  q  (MA(g))  as 


xt  —  M  T  bo  £t  +  b\  £t-i  +  •  •  •  +  bq  £t-q  ,  bo  —  l ,  t  c  T .  (3.1) 

In  the  following,  we  assume  a  white  noise  process  for  the  so-called  innovation  {^} 
governing  the  processes  considered.  In  general,  we  assume  bq  ^  0.  Let  us  consider 
a  special  case  before  we  enter  the  general  discussion  of  the  model. 


MA(1) 

We  set  [i  to  zero  and  q  to  one  in  (3.1)  and  thereby  obtain 


xt  —  St  ~\~  b  £t-\ . 
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As  the  innovations  are  WN(0,  cr2),  the  moments  are  independent  of  time:  Firstly,  it 
holds  that  E(vr)  —  {i  —  0.  Secondly,  one  obtains  from 

y(k)  =  Cov(x„  xt+h)  =  E  ((e,  +  b  e,_ i)  ( st+h  +  b  et+h- 0) 

immediately 


]/(0)  =  a2(l  +  b2) ,  y{\)=a2b, 


and 


y(h)  —  0  for  h  >  1  . 

For  the  MA(1)  process  considered  the  autocorrelation  function  p(h ) 
therefore  is: 


y(h) 

y(o) 


mo  =  \  +  b2'  p ^  =  0>  h  >  L 

This  is  why  the  process  is  always  (weakly)  stationary  without  any  assumptions  with 
respect  to  b  or  to  the  index  set  T.  Elementary  curve  sketching  shows  that  p(l) 
considered  with  respect  to  b , 


p(l;b) 


b 

1  +62’ 


becomes  extremal  for  b  =  ±1.  For  b  —  —1,  p(l)  takes  the  minimum  —  and  for 
b  —  1  it  takes  the  maximum  \  (see  Problem  3.1). 

In  Fig.  3.1  realizations  were  simulated  by  means  of  so-called  pseudo  random 
number  with  each  50  observations  of  moving  average  processes.  All  three  graphs 
were  generated  by  the  same  realizations  of  {sj.  The  first  graph  with  b\  —  b  —  0 
shows  the  simulation  of  a  pure  random  process  {sj:  The  times  series  oscillates 
arbitrarily  around  the  expected  value  zero.  The  third  graph  with  b\  —  b  —  0.9 
depicts  the  case  of  (strong)  positive  autocorrelation  of  first  order:  Positive  values  are 
followed  by  positive  values  whereas  negative  values  tend  to  entail  negative  values, 
i.e.  the  zero  line  is  less  often  crossed  than  in  the  first  case.  Finally,  the  graph  in  the 
middle  lies  in  between  both  extreme  cases  as  weak  positive  autocorrelation  ( b\  — 
b  —  0.3)  is  present. 


MA(q) 

The  results  obtained  for  the  MA(1)  process  can  be  generalized.  Every  MA  process 
is  -  for  all  parameter  values  independent  of  starting  value  conditions  -  always 
stationary.  For  the  MA(g)  process,  it  holds  that  its  autocorrelation  sequence  vanishes 
from  the  order  q  on  (i.e.  it  becomes  zero).  The  proof  is  elementary  and  is  therefore 
omitted. 
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Fig.  3.1  Simulated  MA(1)  processes  with  a2  =  1  and  /z  =  0 


Proposition  3.1  (MA(g))  Assume  an  MA(q)  process  from  (3.1). 

(a)  The  process  is  stationary  with  expectation  pt. 

(b)  For  the  auto  covariances  it  holds  that, 


y(li)  =  o2(bh  +  bh+ibi  +  . . .  +  bqbq-h)  ,  h  =  0,l,...q, 


and  y(h)  —  Oforh  >  q. 

Example  3.1  (Seasonal  MA)  Let  S  denote  the  seasonality,  e.g.  S  =  4  or  S  =  12  for 
quarterly  or  monthly  data,  or  S  =  5  for  (work)  daily  observations.  Hence,  we  define 
as  a  special  MA  process 


Xt  —  St  +  b  St—s- 


!For  S  =  1  we  obtain  the  MA(1)  case,  however,  without  a  seasonal  interpretation. 
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In  this  context,  q  —  S  and  bq  —  b  hold  and 

bi  =  b2  =  ...  =  bq- 1  =  0 . 


Proposition  3.1  (b)  yields: 


y (0)  =  a2Oo  +  b\  4 - h  b2q)  =  a2(l  +  b2)  , 


y(  1)  —  &2(b\  +  b2  b\  +  •  •  •  +  bq  bq—l)  —  0  , 


and  also 


y(h)  —  0  for  h=l,2,...,S—  1. 

The  case  h  —  S,  however,  yields 

y(S)  =  cr2bsbo  =  o2b . 

For  h  >  S,  according  to  Proposition  3.1  it  holds  that  y(/i)  =  0.  Hence,  the  process  at 
hand  is  exclusively  autocorrelated  at  lag  S,  which  is  why  it  is  also  called  a  seasonal 
MA  process.  ■ 


MA(oo)  Processes 

We  now  let  q  go  off  to  infinity.  For  reasons  that  become  obvious  in  the  fourth  section, 
the  MA  coefficients,  however,  are  not  denoted  by  bj  anymore.  Instead,  consider  the 
infinite  real  sequence  {cj}J  e  N,  to  define: 


oo 

xt  —  M  Cj  £t-j  , 

oo 

Eio 

o 

II 

o 

II 

<  oo  ,  Co  =  1 .  (3.2) 


Sometimes  the  process  from  (3.2)  is  called  “causal”  as  there  are  only  past  or 
contemporaneous  random  variables  st~j,  j  >  0,  entering  the  process  at  time  t.  The 
condition  on  the  coefficients  {q}  of  being  absolutely  summable  guarantees  that 
£~o  cj  £t-j  in  (3.2)  is  a  well-defined  random  variable,  which  then  can  be  called 
xu  see  e.g.  Fuller  (1996,  Theorem  2.2.1).  Absolute  summability  naturally  implies 
square  summability, 


oo 


J2cj  <  °° ’ 

7=0 
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and,  indeed,  sometimes  we  define  the  MA(oo)  process  upon  this  weaker  assumption. 
Square  summability  of  {cf\  is  sufficient  for  stationarity  with  E(vr)  =  /x  and 

oo 

y(h)  =  a2  y]  cj  cj+h  , 
i=o 

see  Fuller  (1996,  Theorem  2.2.3),  which  is  the  first  result  of  the  follow  proposition. 
The  second  one  is  established  in  Problem  3.3. 

Proposition  3.2  (Infinite  MA)  Assume  an  MA( oo )  process, 

oo 

xt  =  M  +  F  cj  s‘-j  ’  ~  WW(0,  CT2) ,  Co  -  1 , 

7=0 


with  cj  <  00 - 

(a)  The  process  is  stationary  with  expected  value  pi,  and  for  the  auto  covariances  it 
holds  that 


oo 

y(h)  —  a2  2J  cj  cj+h ,  h  =  0,1,...  . 
7=0 


(b)  Under  absolute  summability, 
is  absolutely  summable: 


TZ 


'7=0 


<  oo,  the  sequence  of  auto  covariances 


oo 

^2\y(h)\  <  OO. 
h= 0 


The  fact  that  the  sequence  of  autocovariances  is  absolutely  summable  under  (3.2), 
see  (b)  of  the  proposition,  has  an  immediate  implication:  the  autocovariances  tend 
to  zero  with  growing  lag: 


y(h)  — >  0  as  h  ->  oo  . 

This  means  that  the  correlation  between  xt  and  xt~h  tends  to  decrease  with  growing 
lag  h. 

Note  that  xt  from  (3.2)  or  Proposition  3.2  is  defined  as  a  linear  combination  of 
st-j,  j  >  0.  Therefore,  one  sometimes  speaks  of  a  linear  process.  Other  authors 
reserve  this  label  for  the  more  restricted  case  of  iid  innovations,  where  all  temporal 
dependence  of  {xt}  arises  exclusively  from  the  MA  coefficients  {<y}.  The  results  of 
Proposition  3.2,  however,  hold  under  the  weaker  assumption  that  {^}  is  white  noise. 
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Impulse  Responses 

The  MA(oo)  coefficients  are  often  called  impulse  responses,  since  they  measure 
the  effect  of  a  shock  j  periods  ago  on  xt: 


3  xt 


Square  summability  implies  that  shocks  are  transient  in  that  Cj  ->  0  as  j  ->  oo.  The 
speed  at  which  the  impulse  responses  converge  to  zero  characterizes  the  dynamics. 
In  particular,  Campbell  and  Mankiw  (1987)  popularized  the  cumulated  impulse 
responses  as  measure  of  persistence,  where  the  cumulated  effect  is  defined  as 

j 

cir(j )  =  J2  c, . 

j= o 

This  measure  quantifies  the  total  effect  up  to  /  periods  back,  if  there  occurred  a  unit 
shock  in  each  past  period,  including  the  present  period  at  time  t.  Asymptotically, 
one  obtains  the  so-called  long-run  effect,  often  called  total  multiplier  in  economics. 
This  measure  is  defined  as 


CIR  :=  lim  CIR(J) ,  (3.3) 

7—^oo 

provided  that  this  quantity  exists.  Clearly,  under  absolute  summability  of  the 
impulse  responses,  CIR  is  well  defined.  Andrews  and  Chen  (1994)  advocated  CIR 
as  being  superior  to  alternative  measures  of  persistence.  We  will  return  to  the 
measurement  and  interpretation  of  persistence  in  the  next  chapter. 

The  MA(oo)  process  is  of  considerable  generality.  In  fact,  Wold  (1938)  showed 
that  every  stationary  process  with  expected  value  zero  can  be  decomposed  into  a 
square  summable,  purely  non-deterministic  MA(oo)  component  and  an  uncorrelated 
component,  say  {3?},  which  is  deterministic  in  the  sense  that  it  is  perfectly 
predictable  from  its  past: 

oo  oo 

Cj  s,-j  +  8, ,  7>f  <00,  E  (s,-jSt)  =  0 ,  (3.4) 

j=0  j= 0 


2Typically,  one  assumes  that  8t  is  identically  equal  to  zero  for  all  t,  since  “perfectly  predictable” 
only  allows  for  trivial  processes  like  e.g.  8t  =  (— 1  )fA  or  8t  =  A,  where  A  is  some  random  variable, 
such  that  8t  =  —8t—  i  or  8t  =  8t—  i,  respectively.  Of  course,  this  does  not  rule  out  the  case  of  a 
constant  mean  /x  different  from  zero  as  assumed  in  (3.2). 
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where  E(^)  =  E(<y  =  0.  For  details  on  this  Wold  decomposition  see  Fuller  (1996, 
Theorem  2.10.2)  or  Brockwell  and  Davis  (1991,  Theorem  5.7.1). 

In  practice  it  is  not  reasonable  to  construct  a  model  with  infinitely  many 
parameters  cj  as  they  cannot  be  estimated  from  a  finite  number  of  observations 
without  further  restrictions.  In  the  fourth  section  of  this  chapter,  however,  we  will 
learn  that  very  simple,  so-called  autoregressive  processes  with  a  finite  number  of 
parameters  possess  an  MA(oo)  representation.  In  order  to  discuss  autoregressive 
processes  rigorously,  we  need  to  concern  ourselves  with  polynomials  in  the  lag 
operator  and  their  invertibility. 


3.3  Lag  Polynomials  and  Invertibility 

Frequently,  time  series  models  are  written  by  means  of  the  lag  operator3  L.  The 
lag  operator  shifts  the  process  {xt}  by  one  unit  in  time:  Lxt  —  xt-\.  By  the  inverse 
operator  L~l  the  shift  is  just  reversed,  L~lLxt  —  xt,  or  L~lxt  =  xt+\.  Successive  use 
of  the  operator  is  denoted  by  its  power,  Vxt  —  xt-j,j  e  Z.  The  identity  is  described 
with  L°,  L°xt  —  xt.  Applied  to  a  constant  c,  the  operator  leaves  the  value  unchanged, 
Lc  —  c. 


Causal  Linear  Filters 

Fet  us  consider  an  input  process  {xt},  t  e  T,  which  is  transferred  into  an  output 
process  {y?}  by  linear  filtering, 


p 

y,  =  J2wjXt-j>  (3.5) 

j= o 

with  the  real  filter  coefficients  {wj}  (which  need  not  to  add  up  to  one).  Therefore,  yt 
is  a  linear  combination  of  values  of  xt-j  where  we  assume  constant  filters,  i.e.  the 
weights  do  not  depend  on  t.  In  particular,  the  filter  is  called  causal  as  yt  is  defined 
by  past  and  contemporaneous  values  of  {xt}  only. 

The  general  linear  causal  filter  from  (3.5)  can  be  formulated  as  polynomial  in 
the  lag  operator: 


p 

F(L )  =  wj  V  with  yt  —  F(L )  xt . 

j= o 


3  Many  authors  also  speak  of  the  backshift  operator  and  write  B. 
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Occasionally,  two  filters  are  put  in  a  row: 


y,  =  F i  (L)  xt,  zt  —  F2(L)  y,  =  F2(L)  F\  (L)  x, , 

PI  P2 

p\  (L)  =  J2w'-j  v  •  f2  (o  =  J2  W2-j u  ■ 

7=0  7=0 

Then,  the  filter  F(L )  transforming  {xt}  into  { zt }  is  defined  by  multiplying  the  filters, 
F(L)  —  F2{L)F\{L ),  which  is  called  a  “convolution”: 

P1+P2  k  k 

F2(L)  F\(L)  =  ^  VkL>C  >  Vk  =  ^2kV2jWiy-j  =  ^2  w2,k-jWhj  . 

k= 0  7=0  7=0 

This  convolution  is  commutative: 

F\  (L)  F2(L)  =  F2(L)  F\  (L)  . 

By  expressing  filters  by  means  of  the  lag  operator,  we  can  manipulate  them  just 
as  ordinary  (complex- valued)  polynomials.  As  an  example  we  consider  so-called 
difference  filters. 

Example  3.2  (Difference  Filters)  By  means  of  the  lag  operator,  filters  can  be 
constructed,  for  example  the  difference  filter  A  =  1  —  L  or  the  difference  of  the 
previous  year  for  quarterly  data  A  4  =  1  —  L4: 


Axt  —  xt—  xt-\  ,  A4xt  —  xt  —  xt-4  . 


The  seasonal  difference  filter  for  monthly  observations  (S  —  12)  as  well  as  for  daily 
observations  ( S  =  5)  are  defined  analogously: 

As  =  1  —  Ls  . 

Instead  of  extensively  calculating  the  double  difference, 

A(Axt)  =  A(xt  -  xt- 1)  =  (. xt  -  xt- 1)  -  L(xt  -  xt-\) 

=  (xt  -  xt- 1)  -  (xt-\  -  Xt-2)  =xt-  2xt-\  +  xt-2  , 


we  write  in  short  by  expanding  (1  —  L)2: 

A2xt  =  (1  -  L)2xt  =  (1  -  2 L  +  L2)^  =  ^  -  2^-i  +  xt-2  . 
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While  the  ordinary  difference  operator  A  eliminates  a  linear  time  trend  from  a  time 
series,  the  second  differences  naturally  remove  a  quadratic  time  trend: 

A2  (a  +  bt  +  ct 2)  =  c(t2  —  2  (t  —  l)2  +  (t  —  2)2)  =  2c. 

The  fact  that  the  order  of  filtering  is  exchangeable  is  well  demonstrated  by  means 
of  the  example  of  seasonal  and  ordinary  differences: 

A  As  =  (1  -  L)  (1  -  Ls)  =  1  -  L  -  Ls  +  Ls+l 
=  (l-Ls)(\-L)  =  AsA.  U 

Invertibility  of  Lag  Polynomials 

We  define  as  a  polynomial  of  degree  p  (also  of  order  p)  in  the  lag  operator 


P(L)  —  1  +  b\  L  +  •  •  •  +  bp  Lp  ,  bp  ^  0  ,  (3.6) 

with  the  real  coefficients  b\  to  bp.  For  brevity,  P(L)  is  also  called  lag  polynomial. 
Consider  a  first  degree  polynomial  as  a  special  case  of  (3.6), 

A\(L)  =  1  —  aL , 


where  the  reason  for  the  negative  sign  will  immediately  be  obvious.  When  and  how 
can  this  polynomial  be  inverted?  A  comparison  of  coefficients  (“method  of  unde¬ 
termined  coefficients”)  results  in  the  following  series  expansion  (see  Problem  3.4): 


(1  —  aL)  1 


1 

ML) 


ajV  . 
y=o 


As  is  well  known,  it  holds  that  (infinite  geometric  series,  see  Problem  3.2) 


oo  . 

yv  =  — 

^  1  —  a 

j= o 


<  oo  <=> 


a 


<  1  . 


(3.7) 


Hence,  it  holds  that 


(1  —  aL)  1  =  -  =  Y  a^V  with 

AAL)  ^  ^ 

1V  2  j= o  j= o 


\a- 


1  - 


a 


<  oo 


if  and  only  if  \a\  <  1.  This  condition  of  invertibility  is  frequently  reformulated.  In 
order  to  do  this,  we  determine  the  so-called  z-transform  of  the  lag  polynomial  with 
z  being  an  element  of  the  complex  numbers  (z  e  C):  A\(z)  —  1  —  az.  Now,  \a\  <  1 
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implies  that  the  z-transform  A\(z)  —  1  —  az  exhibits  only  roots  outside  the  unit 
circle,  i.e.  roots  being  greater  than  one  in  absolute  value: 


a  <  1 

Ai(z)  =  0  =>>  z  —  >1 

a 

(3.8) 


Causally  Invertible  Polynomials 


This  condition  of  invertibility  for  A\(L)  from  (3.8)  is  easily  conveyed  to  a  polyno¬ 
mial  P(L)  of  the  order  p.  We  say  P(L )  is  causally  invertible  if  there  exists  a  power 
series  expansion  with  non-negative  powers  and  absolutely  summable  coefficients: 


(P(L))  1  =  Pin  =  J2aJLj  with  J2  N  <  00 

^  ’  7=0  7=0 


1 


oo 


oo 


The  invertibility  depends  on  the  z-transform 


P(z)  =  1  +  b\  z  4 - h  bp  zp  ,  z  e  C , 


(3.9) 


or  rather  on  the  absolute  value  of  its  roots.  The  following  condition  of  invertibility 
is  adopted  from  Brockwell  and  Davis  (1991,  Thm.  3.1.1),  and  it  is  discussed  as  an 
exercise  (Problem  3.5). 


Proposition  3.3 


(a)  The  polynomial  1  —  aL  is  causally  invertible, 


1 

1  —  aL 


ajLj 

7=0 


<  OO  , 


if  and  only  if  \a\  <  1. 

(b)  The  polynomial  P(L)  from  (3.6)  is  causally  invertible,  i.e.  for  (P(L))~l  there 
exits  the  absolutely  summable  series  expansion, 


(nor1 


<  oo , 


if  and  only  if  it  holds  for  all  roots  of  P(z)  that  they  are  greater  than  one  in 
absolute  value: 


P(z)  =  0 


z 


>  1 


(3.10) 
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Example  3.3  ( Invertible  MA  Processes )  The  MA(1)  process,  xt  —  et  +  bst- 1,  can 
now  be  formulated  alternatively  by  applying  the  MA(1)  polynomial  B(L )  =  1  +  b  L. 
Although  not  required  for  stationarity,  it  is  usually  assumed  that  \b\  <  1.  What  for? 
This  implies  for  the  MA  polynomial  that: 


Biz)  =  0 


>  1. 


According  to  Proposition  3.3  the  condition  of  invertibility  is  fulfilled  and  there  exists 


1 

W) 


1 

TTbL 


oo 


y  ]  ® j 

j= o 


where  the  coefficients  {dj}  are  absolutely  summable.  The  {dj}  are  obtained  explicitly 
by  comparison  of  coefficients  in 


1  =  (1  +bL)  aJLj 

j= o 

—  cxq  T  d\  L  T  0^2  E~  T  0^3  E 2  T  *  *  * 
~\~b  (c^o  L  -\~  oi\  L2  ot 2  L2  +  •  •  • ), 


yielding 


\  —  ao,  0  =  a\  +  bao ,  0  =  d2  +  ba  i,  etc., 


or 


ao  =  l ,  oi\  =  —b,  ct2  —  b  ,  and  dj  =  (—  l);Z?7,y  >  0. 
Hence,  the  MA(1)  process  =  B(L)  st  can  be  reformulated  as  follows: 

s,  =  -  y  =  X,  -  bx,- 1  +  b2  x,-2  -  b 3  Xr-3  ±  . . . , 

1  +  bL 


or 


X,  = 


b^  xt— 2  +  ±  . . .  +  e, 


OO  OO 

=  + e'  ’  ^3  i^i  <  °°- 

;'=  i  ;'=o 
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In  other  words:  The  invertible  MA(1)  process  (for  \b\  <  1)  can  be  expressed  (in 
an  absolutely  summable  manner)  as  a  process  depending  on  its  own  infinitely  many 
lags.  Such  a  process  is  called  autoregressive  (of  infinite  order).  ■ 

Frequently,  one  is  interested  in  the  special  case  of  a  quadratic  polynomial,  p  =  2.  In 
this  case,  the  so-called  Schur  criterion  provides  an  equivalent  reformulation  of  the 
condition  of  invertibility,  rephrasing  (3.10)  in  terms  of  the  polynomial  coefficients 
bi  directly.  These  have  to  fulfill  three  conditions  simultaneously.  We  take  the 
corresponding  statement  from  e.g.  Sydsaeter,  Strpm,  and  Berck  (1999,  p.  58). 

Corollary  3.1  For  p  —  2  the  polynomial  from  (3.6), 

P(L )  =  1  +  biL  +  b2L2  , 

is  causally  invertible  with  absolutely  summable  series  expansion  (P(L))~l  if  and 
only  if: 


(0  1  b2  >0, 

and  (ii)  1  —  b\  +  bi  >  0 , 
and  (iii)  1  +  b\  +  Z?2  >  0 . 


Instead  of  checking  |zi,2 


>  1  for 


*1,2  = 


-b\  ±  fb\  -  4Z>2 


2b, 


it  may  in  practice  be  simpler  to  check  (i)  through  (iii)  from  Corollary  3.1 


3.4  Autoregressive  and  Mixed  Processes 

Let  {xt}  be  given  by  the  following  stochastic  difference  equation, 

Xt  —  v  +  a\  xt-\  +  •••  +  %>  xt—p  +  st ,  ap  ^  0  ,  t  E  T  , 

defining  an  autoregressive  process  of  the  order  p ,  ARQ?).  The  properties  of  the 
general  AR  process  can  be  illustrated  well  at  the  example  p  —  1 . 


AR(1) 

Particularly,  let  p  —  1 : 


Xt  —  v  +  axt—  i  +  £t ,  t  E  T  . 


(3.11) 
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When  replacing  xt-\  by  this  defining  equation,  one  obtains: 


Xf  —  v  +  a(v  +  axt— 2  +  £/— l)  +  £t 
—  v  +  a  v  +  a2  Xt— 2  +  ci  st—  \  +  st. 

Hence,  xt  and  xt-2  prove  to  be  correlated.  Continued  substitution  yields 

Xf  —  V  Cl  V  Cl  V  H-  Cl  Xf— 3  Cl  St— 2  H-  Cl  St—  1  +  St  , 


or  for  any  h  >  0: 


—  (l+^+...  +  ^7  1 )  v  +  ah  xt—h  +  ^  1  /2+1  +  •  •  •  +  ci  st—  i  +  £/. 


Now,  let  us  suppose  that  the  index  set  T  does  not  have  a  lower  bound  at  zero  but 
includes  an  infinite  past,  then  h  can  be  arbitrarily  large  and  the  substitution  can  be 
repeated  ad  infinitum.  If  it  furthermore  holds  that 


then  the  geometric  series  yields  (h  ->  oo) 

1  —  ah  1 

1  +  a  +  . . .  +  a  —  — -  - ^  - , 

l  —  a  1  —  a 

and  ah  xt-h  ->  0  in  a  sense  that  can  be  made  rigoros,  see  Brockwell  and  Davis 
(1991,  p.  71)  or  Fuller  (1996,  p.  39).  In  this  manner,  it  follows  for  h  ->  oo  under  the 
aforementioned  conditions  that: 


oo 

V  v — r 

xt  —  - - h  0  +  /  aJ  st-j. 

X  Cl 

j= 0 

This  way  one  obtains  an  infinite  MA  representation  with  geometrically  decaying 
coefficients,  q  —  in  (3.2).  In  fact,  this  representation  can  formally  be  obtained 
by  inverting  1  —  aL ,  see  Proposition  3.3  (a).  The  process  is  therefore  stationary  with 
(see  Proposition  3.2) 


E  {xt)  =  11=  - - , 

1  —  a 


Var(vr)  —  g2  a2j  — 

7=0 


cr 


1  —  a2 
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and 


oo 


=  ah  y(0). 


It  follows 


p(h)  —  ah. 


Note  that  these  results  are  obtained  for  \a\  <  1.  Furthermore,  stationarity  also 
depends  on  the  index  set.  That  is  to  say,  if  it  holds  that  T  =  {0,  1, 2, . . .}  then  the 
above-mentioned  repeated  substitution  cannot  be  performed  infinitely  often.  From 


Xf  —  ci  a!  l)v  a*  x§  T  a!  8\  T  •  •  •  T  a  St—i  H-  &t 


the  expected  value  follows  as  time-dependent: 


E(x^)  —  (1  ci  a?  ^)  v  T-  <TE(xo),  t  £  {0, 1,...}. 


In  particular,  this  example  shows  that  the  stationarity  behavior  of  a  process  can 
depend  on  the  index  set  T.  Therefore,  in  general  a  stochastic  process  is  not 
completely  characterized  without  specifying  T. 

Example  3.4  (AR(1))  Figure  3.2  displays  50  realizations  each  of  AR(1)  processes 
obtained  by  simulation.  However,  in  the  first  case  a\  —  a  —  0  such  that  the  graph 
depicts  the  realizations  of  {sj.  On  the  right  the  theoretical  autocorrelogram,  p(h)  — 
0  for  h  >  0,  is  shown.  In  the  second  panel,  the  case  of  a  positive  autocorrelation 
(« a\  —  a  —  0.75)  is  illustrated:  Positive  values  tend  to  be  followed  by  positive  values 
(and  vice  versa  for  negative  values),  such  that  phases  of  positive  realizations  tend  to 
alternate  with  phases  of  negative  values.  The  corresponding  autocorrelogram  shows 
the  geometrically  decaying  positive  autocorrelations  up  to  the  order  10.  In  the  last 
case,  there  is  a  negative  autocorrelation  (a\  —  a  —  =0.75);  consistently,  negative 
values  tend  to  be  followed  by  positive  ones  and  vice  versa  positive  values  tend  to  be 
followed  by  negative  ones.  Therefore,  the  zero  line  is  more  often  crossed  than  in  the 
first  case  of  no  serial  correlation.  The  corresponding  autocorrelogram  is  alternating: 
p{h)  —  \a\h(—l)h  for  a  —  —0.75.  Qualitatively  different  patterns  of  autocorrelation 
cannot  be  generated  by  the  simple  AR(1)  model.  Note  that  the  impulse  responses  or 
MA(oo)  coefficients  of  the  AR(1)  model  are  Cj  =  aj.  Consequently,  the  cumulated 
effect  defined  in  (3.3)  becomes  for  \a\  <  1: 
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Fig.  3.2  Simulated  AR(1)  processes  with  a2  =  1  and  v  =  0 


The  larger  a,  the  larger  is  CIR ,  which  hence  quantifies  the  persistence  described 
above  for  positive  a.  ■ 

We  have  seen  that  the  stationarity  of  an  AR(1)  process  depends  essentially  on 
the  absolute  value  of  a .  For  the  Markov  property,  however,  this  value  is  irrelevant. 
We  talk  about  a  Markov  process  if  the  entire  past  information  Xt  up  to  time  t  is 
concentrated  in  the  last  observation  xt: 


p  (xt+s  <  x\ Xt)  =  P  (x,+s  <  x\ Xt) 


for  all  s  >  0  and  x  e  R.  Assuming  a  normal  distribution,  we  show  in  Problem  3.7 
that  every  AR(1)  process  is  a  Markov  process. 


AR  {p) 

In  general,  the  AR(/>)  process  can  be  formulated  equivalently  by  means  of  a  lag 
polynomial: 


A(L)x,=  v  +  et,  A(L)  =  \  —  a\L - apLp,  te  T. 


(3.12) 
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Under  the  condition 


A(z)  =  0 


z 


>  1 


the  autoregressive  polynomial  A  (L)  is  causally  invertible  with  absolutely  summable 
coefficients  according  to  Proposition  3.3: 


1 

ML) 


oo 


oo 


Y « i u  >  Y 

7=0  7=0 


<  OO  . 


Therefore,  under  the  condition  of  invertibility  it  holds  that 


v  +  st 

X'  ~  ~m 


V 

U) 


oo 

+  Y  a>  s‘-j  > 

j= 0 


and  {xt}  is  a  stationary  MA(oo)  process,  see  Proposition  3.2.  Hence,  the  autocovari¬ 
ance  sequence  is  absolutely  summable.  Furthermore,  in  this  case  it  holds  for  h  >  0 
(w.l.o.g.4  we  set  v  =  0  for  simplification), 


yih)  =  Co  v(xuxt+h) 

—  E (Xj  Xt-\-h) 

—  E  {xt{cL\  T  . . .  T  cip  Xf+h—p  T 

—  d  \  y  ( h  —  1 )  T  . . .  T  dp  y  ( h  —  p )  T  0  . 


Dividing  by  y(0)  yields  the  recursive  relation  from  the  subsequent  proposition:  The 
autocorrelations  are  given  by  a  deterministic  difference  equation  of  order  p.  The  still 
missing  necessdry  condition  of  stationarity  from  Proposition  3.4  (A (1)  >  0)  will  be 
derived  in  Problem  3.6. 

Proposition  3.4  (AR (p))  Let  {xt}  be  dn  AR(p)  process  from  (3.12)  with  index  set5 
t  =  {-oo,...,  r}. 


4The  abbreviation  stands  for  “without  loss  of  generality”.  It  is  frequently  used  for  assumptions  that 
are  substantially  not  necessary  and  that  are  only  made  to  simplify  the  argument  or  the  notation.  In 
the  example  at  hand,  generally  it  would  have  to  be  written  (xt—j  —  /x)  for  all  j;  just  as  well,  one  can 
set  /x  =  v/A(l)  equal  to  zero  and  simply  write  xt—j. 

5  Also  in  the  following,  the  notation  {— oo, . . . ,  T}  is  always  to  denote  the  set  of  all  integers  without 

{r+ i,r  +  2,...}. 
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(a)  The  process  has  an  absolutely  summable  MA(oo)  representation  according 
to  (3.2)  if  and  only  if  it  holds  that 


A(z )  =  0 


>  1 . 


Then  the  process  is  stationary  with  expectation  pi  —  v/A(  1).  The  condition 
A{l)  =  \-T.U  aj  >  0  is  necessary  for  this. 

(b)  For  stationary  processes  the  autocorrelation  sequence  is  absolutely  summable 
where  it  holds  that  p(h)  0  for  all  integer  numbers  h  and 


p(h )  =  a\  p(h  —  1)  +  . . .  +  ap  p(h  —  p) ,  h  >  0  . 

Again  note  that  the  absolute  summability  of  p(h)  implies:  p(h)  ->  0  for  h  ->  oo. 
The  farther  xt  and  xt+h  are  apart  from  each  other  the  weaker  tends  to  be  their 
correlation. 

Certain  properties  of  the  AR(1)  process  are  lost  for  p  >  1.  In  Problem  3.8  we 
show  for  a  special  case  ( p  —  2)  that  the  AR(p)  process,  p  >  1,  is  not  a  Markov 
process  in  general. 


AR(2) 


Let  p  —  2, 


Xt  —  v  +  a\  Xt—\  +  $2  Xf~ 2  +  » 


with  the  autoregressive  polynomial 


A(L)  =  1  —  a\L  —  a2L2  . 

From  Corollary  3.1,  we  know  the  conditions  under  which  (A(L))-1  can  be  expanded 
as  an  absolutely  summable  filter: 


(0  1  +  $2  >  0 , 
and  (ii)  1  +  a\  —  a2  >  0  , 
and  (Hi)  1  —  a\  —  a2  >  0  . 

Consequently,  under  these  three  parameter  restrictions  the  AR(2)  process  is  sta¬ 
tionary.  The  restrictions  become  even  more  obvious  if  they  are  solved  for  a2  and 
depicted  in  a  coordinate  system: 


(0  a2>  -1  , 

and  (ii)  a2  <  1  +  a\  , 
and  (iii)  a2  <  1  —  a\  . 
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al 


Fig.  3.3  Stationarity  triangle  for  AR(2)  processes 


So,  three  lines  are  given,  above  or  below  which  <12  has  to  be,  respectively.  This  is 
why  one  also  talks  about  the  stability  or  stationarity  triangle,  see  Fig.  3.3.  Within  the 
triangle  lies  the  stationarity  region. 

The  autocorrelation  series  of  the  AR(2)  case  is  determined  from  Proposition  3.4. 
For  h  —  0,  it  naturally  holds  that  p(0)  =  1.  For  h  —  1,  one  obtains 

p(l)  =  a\  p( 0)  +  a2  p(— 1). 


Because  of  the  symmetry,  p(—h )  =  p(h),  it  follows  that 


a\ 

1—02* 


Similarly,  it  follows  that 


P(2)  = 


By  repeated  insertion  into  the  second  order  difference  equation, 

p(h)  —  a\  p(h  —  1)  +  a2  p(h  —  2)  ,  h  >  2  , 

the  entire  autocorrelation  sequence  is  determined.  Next,  four  numerical  examples 
will  be  considered. 


a\  p(l)  +  a2  p(0) 


a 


1  -  a2 


+  a2. 
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al  =0.7,  a2  =  0.1 


al  =  -1.0,  a2  =  -0.8 

Fig.  3.4  Autocorrelograms  for  AR(2)  processes 
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al  =  -0.4,  a2  =  0.4 


al  =  1.0,  a2  =  -0.5 


Example  3.5  (AR(2))  We  now  consider  four  numerical  examples  for  AR(2)  pro¬ 
cesses  in  order  to  characterize  typical  patterns  of  autocorrelation.  In  all  cases,  the 
stability  conditions  can  be  proven  to  be  fulfilled.  The  corresponding  autocorrelo¬ 
grams  can  be  found  in  Fig.  3.4. 

(i)  a\  —  0.7,  <22  =  0.1:  In  this  case,  all  the  autocorrelations  are  positive  and  they 
converge  to  zero  with  h\  their  behavior  is  similar  to  the  autocorrelogram  of  a 
AR(1)  process  with  a\  >  0,  see  Fig.  3.2. 

(ii)  a\  —  —0.4,  <22  =  0.4:  Starting  with  p(l)  <  0,  the  autocorrelations  alternate 
similarly  to  an  AR(1)  process  with  <21  <  0,  cf.  Fig.  3.2. 

(iii)  <21  =  —1.0,  <22  =  —0.8:  In  this  case,  we  find  a  dynamic  which  cannot  be 
generated  by  an  AR(1)  model;  two  negative  autocorrelations  of  the  first  and 
second  order  are  followed  by  a  seemingly  irregularly  alternating  pattern. 

(iv)  <21  =  1.0,  <22  =  —0.5:  In  the  last  case,  the  autocorrelations  swing  from  the 
positive  area  to  the  negative  area,  then  to  the  positive  one  and  again  to  the 
negative  one  whereas  the  last-mentioned  can  hardly  be  perceived  because  of 
the  small  absolute  values. 
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Therefore,  the  cases  (iii)  and  (iv)  show  that  the  AR(2)  process  allows  for  richer 
dynamics  and  dependence  structures  than  the  simpler  AR(1)  process.  ■ 


Autoregressive  Moving  Average  Processes 

Now,  we  consider  a  combination  of  AR  and  MA  processes.  Again,  {xr}  is  given  by 
a  stochastic  difference  equation  of  order  p ,  only  that  st  from  (3.12)  is  replaced  by  an 
MA(g)  process: 


Xt  —  V  -\-  Cl\  Xt—\  +•••+%>  Xt-p  +  £t  +  b\  £t-i  +  •  •  •  +  bq  £t-q  ,  t  G  T  . 

Abbreviating,  we  talk  about  ARMA(p,  q )  processes  assuming  ap  ^  0  and  bq  ^  0. 
Again,  a  more  compact  representation  follows  by  using  lag  polynomials, 

A(L)xt  =  v  +B(L)£t,  te  T,  (3.13) 

where  it  is  assumed  that  both  polynomials 

A(L)  =  1  —  ai  L - apLp  and  B(L)  =  1  +  bxL  +  •  •  •  +  bqLq 


do  not  have  common  roots. 

A  stationary  MA(oo)  representation  hinges  on  the  autoregressive  polynomial 
such  that  the  stationarity  condition  can  be  adopted  from  the  pure  AR (p)  case  in 
Proposition  3.4.  It  amounts  to  an  absolutely  summable  expansion  of  (A(L))-1 
such  that  the  process  possesses  an  absolutely  summable  representation  as  MA(oo) 
process,  see  (3.2).  If  stationarity  is  given,  the  absolutely  summable  autocorrelation 
sequence  can  again  be  determined  from  a  stable  difference  equation  (see  e.g. 
Brockwell  &  Davis,  1991,  p.  93) 

Proposition  3.5  (ARMA(p,  q))  Let  {xt)  be  anARMA(p ,  q)  process  from  (3.13)  with 
t  =  {-oo,...,  r}. 


(a)  The  process  has  an  absolutely  summable  MA(oo)  representation  according 
to  (3.2)  if  and  only  if  it  holds  that 


A(z)  —  0 


>  1 . 


Then  the  process  is  stationary  with  expectation  pi  =  v/A(l).  The  condition 
a{  i)  =  i  -  EjLi  aj  >  0  is  necessary  for  this. 

(b)  For  stationary  processes  the  autocorrelation  sequence  is  absolutely  summable 
where  it  holds  that  p(h)  0  for  all  integer  numbers  h  and 


p(h )  =  a\  p(h  —  1)  +  . . .  +  ap  p(h  —  p)  ,  h  >  max^,  q  +  1)  . 
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For  h  >  ma x(p,  q  +  1)  the  autocorrelations  satisfy  the  same  difference  equation  as 
in  the  pure  AR (p)  case.  It  is  known  that  the  solution  to  such  a  difference  equation 
is  bounded  exponentially.  Consequently,  we  have  the  following  result,  see  e.g. 
Brockwell  and  Davis  (1991,  Prob.  3.11):  For  a  stationary  ARM  A  process  there  exist 
positive  constants  c  and  g  with  0  <  8  <  1,  such  that 

\p(h)\  <  cgh  (3.14) 


for  all  h  —  1,2,....  Hence,  the  decay  rate  of  the  autocorrelations  is  bounded 
exponentially.  This  shows  again  that  stationary  ARMA  process  are  characterized 
be  absolutely  summable  autocorrelations,  see  Problem  3.2.  Proposition  3.5  will  be 
discussed  below  for  p  —  q  —  1 . 

Before  turning  to  the  ARM A(  1,1)  case,  we  note  that  an  analogous  result  to 
Proposition  3.5  (a)  is  available  for  an  AR(oo)  representation  according  to  Propo¬ 
sition  3.3:  The  ARMA  process  has  an  absolutely  summable  AR(oo)  representation 
(see  Example  3.3)  if  and  only  if 


B(z)  =  0 


>  1 . 


In  this  case  the  ARMA  process  is  called  invertible. 


ARMA(1,1) 


We  now  wish  to  obtain  the  autocorrelation  structure  of  the  ARMA(1,1)  process, 


xt  =  axt-i  +  et  +  bst-u 


a 


<  1,  \b\  <  1, 


where  \a\  <  1  for  stationarity,  and  \b\  <  1  to  ensure  invertibility  (see  Example  3.3). 
The  condition  of  no  common  roots  of  1  —aL  and  1  +  bL  is  given  for  a  ^  —  b.  In  the 
case  of  common  roots,  the  lag  polynomials  could  be  reduced  and  one  would  obtain 


xt  = 


1  +bL 
1  +bL 


£t  =  St 


for  a  —  —b  . 


Due  to  the  invertibility  of  1  —  aL ,  the  process  can  be  formulated  as  an  infinite  MA 
process, 


st  +  bet-i 

1  —  aL 

oo  oo 

ajst-j  +  bJ2  aj  st-i-j 
j= o  j= o 

OO 

st  +  ^2  ( aJ  +  b  d ~l )  £t-j » 
;=1 


Xt  = 
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where  a  shift  of  subscripts  was  carried  out.  In  the  notation  of  (3.2)  the  MA(oo) 
coefficients  result  as 


Cj  —  (J  1  (a  +  b) ,  j  >  1 . 
In  this  way  Proposition  3.2  yields  for  the  variance: 


K(0)  =  O1 


n  +  ay~2  (a  +  bA 


9  (1  +  b2  +  2 ab) 
G  l -a2 


The  autocovariance  at  lag  one  follows  in  the  same  way, 


2  (a  +  b)(l+ab) 

Y(  1)  =  v  - : — 3 - 


such  that  it  holds: 


(< a  +  b)(  1  +  ab) 
1  +  b2  +  lab 


Furthermore  we  learn  from  Proposition  3.5: 


p(h)  —  a  p(h  —  1),  h  >  2. 


Hence,  the  MA(1)  component  (that  is  b)  influences  directly  only  p(l)  having  only 
an  indirect  effect  beyond  the  autocorrelation  of  the  first  order:  For  h  >  2  a  recursive 
relation  between  p(h)  and  p(h  —  1)  holds  true  just  as  it  applies  to  the  pure  AR(1) 
process.  Thus,  this  yields  four  typical  patterns.  In  order  to  identify  these,  it  suffices 
entirely  to  concentrate  on  the  numerator  of  p(l)  as  well  as  on  the  sign  of  a  as  the 
denominator  of  p(l)  is  always  positive  because  it  is  a  multiple  of  the  variance.  As 
1  +  ab  is  positive  due  to  the  stationarity  and  the  invertibility  of  the  MA  polynomial, 
the  behavior  of  the  autocorrelogram  depends  on  the  signs  of  a  +  b  and  a  only.  The 
exponential  bound  for  the  autocorrelations  is  easily  verified: 

p(h)  —  a  p(h  —!)  =  •••  =  ah~l p(  1)  .  h  >  2. 


Therefore,  (3.14)  applies  with  g  =  \a\  and  c  —  |p(l)|/|a| 


Example  3.6  (ARMA(1,1))  The  four  possible  patterns  of  the  autocorrelogram  of  a 
stationary  and  invertible  ARM A(  1,1)  model  will  be  discussed  and  illustrated  by 
numerical  examples,  cf.  Fig.  3.5. 
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al  =0.75,  bl  =0.75 


al  =  0.75,  bl  =  -0.90 


al  =  -0.5,  bl  =  0.90 

Fig.  3.5  Autocorrelograms  for  ARMA(1,1)  processes 


al  =  -0.50,  bl  =  -0.90 


Case  1:  Obviously,  for  a  +  b  >  0  it  holds  that  p(l)  >  0.  If,  furthermore,  a  >  0 
then  the  entire  autocorrelogram  proceeds  above  the  zero  line. 

Case  2:  For  a  +  b  <  0  and  a  >  0  one  obtains  exclusively  negative  autocorrela¬ 
tions. 

Case  3:  An  alternating  pattern,  starting  with  p(l)  >  0,  is  obtained  for  a  <  0  and 
a  >  —b. 

Case  4:  For  a  <  0  and  a  <  —b  the  autocorrelation  series  is  as  well  alternating  but 
starts  with  a  negative  value. 

Note  that  cases  1  and  4  can  be  generated  qualitatively  by  a  pure  AR(1)  process  as 
well.  For  cases  2  and  3,  however,  there  occur  patterns  which  cannot  be  produced 
by  an  AR(1)  process.  Therefore,  the  ARMA(1,1)  process  allows  for  richer  dynamic 
modeling  than  the  AR(1)  model  does.  Also  when  comparing  with  the  AR(2)  case, 
we  find  that  the  ARM A(  1,1)  model  allows  for  additional  dynamics.  ■ 


68 


3  Autoregressive  Moving  Average  Processes  (ARMA) 


3.5  Problems  and  Solutions 


Problems 

3.1  Where  does  the  first  order  autocorrelation  of  an  MA(1)  process  (p(l)  =  ) 

have  its  maximum  and  minimum? 

3.2  Show  for  g  e  R  \  {1}  (geometric  series): 


n  1  _  p*+l 

Es'=y 

1=0 


Conclusion:  For  \g\  <  1  and  n  ->  oo  it  holds  that: 


oo  . 

i= 0 


g 


3.3  Prove  part  (b)  from  Proposition  3.2. 

3.4  Derive  the  series  expansion 


oo 


(1  -  aL)~'  =  ^ ~2ajLJ 

7=0 


for  real  a  with  I  a  <  1. 


3.5  Prove  Proposition  3.3  (b). 


3.6  Prove  the  necessary  condition  of  causal  invertibility  from  Proposition  3.4,  that 
is: 


{A(z)  =  0  =»  \z\  >  1}  =*  {A(  1)  >  0}  , 


where  A(z)  =  1  —  a\  z  —  . . .  —  ap  zp. 


3.7  Let  {£?}  ~  WN(0,  a2)  be  a  Gaussian  process.  Show  that  {xt}  with  xt  —  a\  xt~\  + 
8t  is  a  Markov  process. 

3.8  Let  {e,}  ~  WN(0,  a2).  Show  that  {vr}  with  xt  =  a2Xt-2  +  et  is  not  a  Markov 
process  (<22  ^  0). 
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Solutions 


3.1  The  traditional  way  to  solve  the  problem  is  curve  sketching.  We  consider  the 
first  order  autocorrelation  to  be  a  function  of  b : 


Then  the  quotient  rule  for  the  first  order  derivative  yields 


1  ^b1  -2b1  _  (1  -b)(l  +b) 
(1  +  b2)2  ~  (1  +  b2)2 


The  roots  of  the  derivative  are  given  by  \b\  —  1.  In  b  —  —1  there  is  a  change 
of  sign  of  fib),  namely  from  a  negative  to  a  positive  slope.  Hence,  in  b  —  —  1 
there  is  a  relative  (and  also  an  absolute)  minimum.  Because  of  f(b)  being  an  odd 
function  (symmetric  about  the  origin),  there  is  a  maximum  in  b  —  1.  Therefore,  the 
maximum  possible  correlation  in  absolute  value  is 


i/(-i)i  =/o)  = 


One  may  also  tackle  the  problem  by  more  elementary  means.  Note  that  for  b  ^  0 


which  is  equivalent  to 


with f(b)  —  p(l)  defined  above.  Since  |/(— 1)|  —  1)  =  this  solves  the  problem. 

3.2  By  we  denote  the  following  sum  for  finite  n : 


n 


i= 0 


Multiplication  by  g  yields 


g  Sn  —  g  +  g2  +  . . .  +  gn  +  gn+x . 


70 


3  Autoregressive  Moving  Average  Processes  (ARMA) 


Therefore,  it  holds  that 


Sn~gSn=  1  -gn+l. 


By  ordinary  factorization  the  formula 


i  -  g"+ 1 
i -g 


and  therefore  the  claim  is  verified. 

3.3  The  absolute  summability  of  y(h)  follows  from  the  absolute  summability  of  the 
linear  coefficients  {cj}  allowing  for  a  change  of  the  order  of  summation.  In  order  to 
do  so,  we  first  apply  the  triangle  inequality: 


E^)l 

h= 0 


oo 


E 

h= 0 


cjcj+h 

7=0 


oo  oo 

—  e  e  \cjcj+h 

h= 0  j=0 


oo  oo 


h = 0  7=0 


7=0 


oo 

E 

h = 0 


9 


where  at  the  end  round  brackets  were  placed  for  reasons  of  clarity.  The  final  term  is 
further  bounded  by  enlarging  the  expression  in  brackets: 


oo 


E 


Therefore,  the  claim  follows  indeed  from  the  absolute  summability  of  {cj}. 
3.4  For  the  proof  we  denote  (1  —  aL)~l  as 
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and  determine  the  coefficients  otj.  By  multiplying  this  equation  with  1  —  aL ,  we 
obtain 


1  =  (1  —  aL)  Y  otjV 

j= o 


—  CXq  T  Ot\L}  -|-  Ot 2^  +  .  .  . 

—a  cioL 1  —  a  ot\ L2  —  a  ot2 L3  —  . 

Now,  we  compare  the  coefficients  associated  with  V  on  the  left-  and  on  the  right- 
hand  side: 


1  =  0i0  , 

0  =  oi\  ~  a  ao  , 
0  =  a'2  —  act\  , 


0  =  otj  —  a  otj- 1  ,  j  >  1 . 

As  claimed,  the  solution  of  the  difference  equation  obtained  in  this  way,  {otj  — 
a  otj- 1),  is  obviously  otj  —  aK 

3.5  We  factorize  P{z)  —  1  +£qz+  . . .  +bpzp  with  roots  zi , . . . ,  zp  of  this  polynomial 
(fundamental  theorem  of  algebra): 


P(z)  =  bp  (z  -  z\) . . .  (z  -  zp) . 

From  each  bracket  we  factorize  —  zj  out  such  that 

P(z)  =  bp  (~l)p  zi  ...zp  ^1  -  ...  ^1  -  . 

Because  of  P(0)  =  1,  we  obtain  bp  (—  \)p  Z\  ...  Zp  —  1  •  Therefore  the  factorization 
simplifies  to 


=  Pi(z)  ■■■  Pp(z), 


with 


z 

Pk(z)  =  1 - =  1  ~KkZ,  k=l,...,p, 

Zk 
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where  ^  =  l/zk-  From  part  a)  we  know  that 


oo  oo 


if  and  only  if 


Now,  consider  the  convolution  (sometimes  called  Cauchy  product)  for  k  ^  l : 


with 


j 


i= 0 


We  have  I cj\  <  00  if  and  only  if  both  Pk  [(L)  and  P£  l(L )  are  absolutely 
summable,  which  holds  true  if  and  only  if 

\zk\  >1  and  \zt\  >  1 . 

Repeating  this  argument  we  obtain  that 


if  and  only  if  (3.10)  holds.  Quod  erat  demonstrandum. 

3.6  At  first  we  reformulate  the  autoregressive  polynomial  A  (z)  =  1—  a\z—  . . .  —apzp 
in  its  factorized  form  with  roots  z\ , . . . ,  zp  (again  by  the  fundamental  theorem  of 
algebra): 

A(z)  =  —ap(z  -z\)  ...  (z  —  zp) . 

For  z  —  1  this  amounts  to 


A(l)  =  -ap(  1  -  zi)  ...  (1  -  zp) . 


(3.15) 
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Because  of  A(0)  =  1  we  obtain  as  well: 

1  =  —ap(—\)pz\  . . .  zp  .  (3.16) 

Now  we  proceed  in  two  steps,  treating  the  cases  of  complex  and  real  roots  separately. 

(A)  Complex  roots:  Note  that  for  a  root  z\  E  C  it  holds  that  the  complex  conjugate, 
Zi  —  zi,  is  a  root  as  well.  Then  calculating  with  complex  numbers  yields  for 
the  product 


(1  -  zi)(l  -  zi)  —  (1  -  zi)(l  -  z\) 

=  (1  -Zl)(l  -Zl) 

=  |1  -zi|2  >  0. 

Hence,  for  p  >  2,  complex  roots  contribute  positively  to  A(l)  in  (3.15).  If 
p  —  2,  the  roots  are  only  complex  if  <22  <  0,  since  the  discriminant  is  a\  +  Aap, 
hence,  A(l)  >  0  by  (3.15). 

(B)  Since  the  effect  of  complex  roots  is  positive,  we  now  concentrate  on  real  roots 
Zi,  for  which  it  holds  that  \zt\  >  1  by  assumption.  So,  we  assume  without  loss 
of  generality  that  the  polynomial  has  no  complex  roots,  or  that  all  complex 
roots  have  been  factored  out.  Two  sub-cases  have  to  be  distinguished.  (1)  Even 
degree:  For  an  even  p  we  again  distinguish  between  two  cases.  Case  1,  ap  >  0: 
Because  of  (3.16)  there  has  to  be  an  odd  number  of  negative  roots  and  therefore 
there  has  to  be  an  odd  number  of  positive  roots  as  well.  For  the  latter  it  holds 
that  (1  —  Zi)  <0  while  the  first  naturally  fulfill  (1  —  zi)  >  0.  Hence,  as  claimed, 
it  follows  from  (3.15)  that  A(l)  is  positive.  Case  2,  ap  <  0:  In  this  case  one 
argues  quite  analogously.  Because  of  (3.16)  there  is  an  even  number  of  positive 
and  negative  roots  such  that  the  requested  claim  follows  from  (3.15)  as  well.  (2) 
Odd  degree:  For  an  odd p  one  obtains  the  requested  result  as  well  by  distinction 
of  the  two  cases  for  the  sign  of  ap.  We  omit  details. 

Hence,  the  proof  is  complete. 

3.7  The  normality  of  {£?}  implies  a  multivariate  Gaussian  distribution  of 

£r+l\  /M  \ 

:  ~  :  ,a2Is 

£t+s)  \  \oJ  ) 

with  the  identity  matrix  Is  of  dimension  s.  The  s-fold  substitution  yields 

Xt+s  ~  a\xt  +  a\  ^  8t+ 1  +  •  •  •  +  <2lG+s-l  +  £t+s 

V  —  1 

=  a\x,  +  yp 
1=0 
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The  sum  over  the  white  noise  process  has  the  moments 


and,  furthermore,  it  is  normally  distributed: 


5—1  /  5—1 


T.  «'|  6/+.S-I  ~  -V  I  o,  a2  a]' 


i= 0  \  i= 0 


Hence,  xt+s  given  xt  follows  a  Gaussian  distribution  with  the  corresponding 
moments: 


As  xt+s  can  be  expressed  as  a  function  of  xt  and  et+\ , . . . ,  st+s  alone,  the  further  past 
of  the  process  does  not  matter  for  the  conditional  distribution  of  xt+s.  Therefore,  for 
the  entire  information  Xt  up  to  time  t  it  holds  that: 


Hence,  the  Markov  property  (2.9)  has  been  shown.  It  holds  independently  of  the 
concrete  value  of  a\. 

3.8  For 

xt  =  a2xt-  2  +  eu 

we  obtain  for  s  =  1  the  conditional  expectations  E(xt+\  \  Xt)  —  a2xt-\ ,  and 

E(vr+i  |  xt)  =  E(a2xt-i  +  et+\ |  xt)  =  a2  E(xt-\  \  xt)  , 

with  Xt  —  o(xt ,  xt-\ , . . . ,  xi).  As  the  conditional  expectations  are  not  equivalent,  the 
conditional  distributions  are  not  the  same.  Hence,  it  generally  holds  that 


P(xr+1  <  x\ xt)  ^  P(xr+1  <  x\  Xt)  , 


which  proves  that  {xt}  is  not  a  Markov  process. 
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Spectra  of  Stationary  Processes 


4 


4.1  Summary 

Spectral  analysis  (or  analysis  in  the  frequency  domain)  aims  at  detecting  cycli¬ 
cal  movements  in  a  time  series.  These  may  originate  from  seasonality,  a  trend 
component  or  from  a  business  cycle.  The  theoretical  spectrum  of  a  stationary 
process  is  the  quantity  measuring  how  strongly  cycles  with  a  certain  period,  or 
frequency,  account  for  total  variance.  Typically,  elaborations  on  spectral  analysis 
are  formally  demanding  requiring  e.g.  knowledge  of  complex  numbers  and  Fourier 
transformations.  In  this  textbook  we  have  tried  for  a  way  of  presenting  and  deriving 
the  relevant  results  being  less  elegant  but  in  return  managing  with  less  mathematical 
burden.  The  next  section  provides  the  definitions  and  intuition  behind  spectral 
analysis.  Section  4.3  is  analytically  more  demanding  containing  some  general 
theory.  This  theory  is  exemplified  with  the  discussion  of  spectra  from  particular 
ARMA  processes,  hence  building  on  the  previous  chapter. 


4.2  Definition  and  Interpretation 

In  this  chapter  we  assume  the  most  general  case  considered  previously,  i.e.  the 
infinite  MA  process  that  is  only  square  summable,  { xt}tej ,  T  c  Z, 


oo 


oo 


(4.1) 


with  | c 1 1-  ~  WN(0,  a2).  The  autocovariances, 


y(h)  =  Cov(xt,  xt+h)  =  y(-h) ,  he  Z , 
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are  given  in  Proposition  3.2  (a).  We  do  not  assume  that  {q}  and  hence  {y(h)}  are 
absolutely  summable,  simply  because  this  will  not  hold  under  long  memory  treated 
in  the  next  chapter.  We  wish  to  construct  a  function  /  that  allows  to  express  the 
autocovariances  as  weighted  cosine  waves  of  different  periodicity,* 2 

y(h)  —  f  cos(A/z)/(A)  dX  . 

J  —IT 

The  basic  ingredient  of  an  analysis  of  periodicity  is  the  cosine  cycle  whose 
properties  we  want  to  recall  as  an  introduction. 


Periodic  Cycles 

By  cx  (t)  we  denote  the  cycle  based  on  the  cosine,3 

exit)  —  cos  (A t)  ,  t  e  M, 

where  A  with  A  >  0  is  called  frequency.  The  frequency  is  inversely  related  to  the 

period  P, 


2  7T 

T' 


For  A  =  1  one  obtains  the  cosine  function  which  is  2n  —  periodic  and  even 
(symmetric  about  the  ordinate): 


c\  (t)  =  cos  (t)  =  cos  (t  +  2tt)  =  c\(t  +  2tt)  , 


C\  (—t)  —  cos  ( — t)  —  cos (t)  —  C\(t). 

More  generally,  it  holds  with  P  —  2tt/A  that: 

exit)  —  cos  (A  t)  —  cos  (A  t  +  2n)  —  cos  (A  (t  +  P))  =  exit  +  P) . 

Therefore  the  cosine  cycle  cx  ( t )  with  frequency  A  has  the  period  P  —  2n/X.  Of 
course,  the  symmetry  of  c\(t)  carries  over: 


a  (t)  =  cx  (~t) . 


!The  assumption  of  absolute  summability  underlies  most  textbooks  when  it  comes  to  spectral 
analysis,  see  e.g.  Hamilton  (1994)  or  Fuller  (1996). 

2From  Brockwell  and  Davis  (1991,  Coro.  4.3.1)  in  connection  with  Brockwell  and  Davis  (Thm. 
5.7.2)  one  knows  that  such  an  expression  exists. 

3 Here,  the  so-called  amplitude  is  equal  to  one  (\cx  (0 1  <  1),  and  the  phase  shift  is  zero  (c\  (0)  =  1). 
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cosine  wave  (lambda  =  1  and  lambda  =  2) 


-6  -4  -2  0  2  4  6 

[" 2pi,  2pi] 


-6  -4  -2  0  2  4  6 

[— 2pi,  2pi] 

Fig.  4.1  Cosine  cycle  with  different  frequencies 

For  A  =  1 ,  A  =  2  and  A  =  0.5  these  properties  are  graphically  illustrated  in  Fig.  4.1. 
Finally,  remember  the  derivative  of  the  cosine, 

dc\(t)  ,  .  ..  . 

-  =  c\(t)  —  —A  sin(Af) , 

t 

which  we  will  use  repeatedly. 

Definition 

For  convenience,  we  now  rephrase  the  MA(oo)  process  in  terms  of  the  lag 
polynomial  C(L)  of  infinite  order, 

oo 

xt  —  ii  +  C(L)  st  with  C(L)  =  cjLj  • 

7=o 
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Next,  we  define  the  so-called  power  transfer  function  Tc  (A)  of  this  polynomial:4 

oo  oo  oo 

Tc  (A)  =  cj  +  2  Z  Z  cici+h  cos  (Xh)  -  A  €  R  \  {A*} .  (4.2) 

j=0  h=  1  7=0 

Note  that  Tc  may  not  exist  everywhere,  there  may  be  singularities  at  some  frequency 
A*  such  that  Tc  (A)  goes  off  to  infinity  as  A  ->  A*;  but  at  least  the  power  transfer 
function  is  integrable.  The  key  result  in  Proposition  4. 1  (e)  is  from  from  Brockwell 
and  Davis  (1991,  Coro.  4.3.1,  Thm.  5.7.2);  it  will  be  proved  explicitly  in  Problem  4.1 
under  the  simplifying  assumption  of  absolute  summability.  The  first  four  statements 
in  the  following  proposition  are  rather  straightforward  and  will  be  justified  below. 

Proposition  4.1  (Spectrum)  Define  for  {xt}  from  (4.1)  the  spectrum 

/(A)  =  7c(A)  f-  . 

2n 


It  has  the  following  properties: 


(a)  /(-A)  =/(A), 

(b)  /(A)  =/(A  +  2n), 

( c )  /(A)  >  0, 

(d)  /(A)  is  continuous  in  A  under  absolute  summability,  V. 1  cj 

(e)  For  all  h  e  Z: 


<  oo. 


y(h)  —  f  /(A)  cos(A/z)  dX  —  2 


cos(A h)  dX . 


Substituting  the  autocovariance  expression  from  Proposition  3.2  into  (4.2),  the 
following  representation  of  the  spectrum  exists: 


(0)  2  ^  j 

/(A)  =  +  —  y]  y(h)  cos(A h)  =  —  y]  y(h)  cos(A h) .  (4.3) 

h=  1  h=  — oo 

The  symmetry  of  the  spectrum  in  Proposition  4. 1  (a)  immediately  follows  from  the 
symmetry  of  the  cosine  function.  From  the  periodicity  of  the  cosine,  (b)  follows  as 
well.  Both  results  jointly  explain  why  the  spectrum  is  normally  considered  on  the 
restricted  domain  [0,  n]  only.  Property  (c)  follows  from  the  definition  of  the  power 
transfer  function,  see  Footnote  6  below.  Finally,  the  continuity  of  /(A)  claimed  in 


4A  more  detailed  and  technical  exposition  is  reserved  for  the  next  section.  Our  expression  in  (4.2) 
can  be  derived  from  the  expression  in  Brockwell  and  Davis  (1991,  eq.  5.7.9),  which  is  given  in 
terms  of  complex  numbers. 
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(d)  under  absolute  summability  results  from  uniform  convergence,  see  Fuller  (1996, 
Thm.  3.1.9). 

We  call  the  function/  (or  fx,  if  we  want  to  emphasize  that  {xt}  is  the  underlying 
process)  the  spectrum  of  {xt}.  Frequently,  one  also  talks  about  spectral  density  or 
spectral  density  function  as/  is  a  non-negative  function  which  could  be  standardized 
in  such  a  way  that  the  area  beneath  it  would  be  equal  to  one. 


Interpretation 

The  usual  interpretation  of  the  spectrum  is  based  on  Proposition  4.1.  Result  (e) 
and  (4.3)  jointly  show  the  spectrum  and  the  autocovariance  series  to  result  from 
each  other.  In  a  sense,  spectrum  and  autocovariances  are  two  sides  of  the  same  coin. 
The  spectrum  can  be  determined  from  the  autocovariances  by  definition  and  having 
the  spectrum,  Proposition  4.1  provides  the  autocovariances.  The  case  h  —  0  with 


is  particularly  interesting.  This  equation  implies:  The  spectrum  at  Ao  measures  how 
strongly  the  cycle  with  frequency  Ao  and  therefore  of  period  Po  =  2tt/Xo  adds  to 
the  variance  of  the  process.  If  /  has  a  maximum  at  Ao,  then  the  dynamics  of  {xt} 
is  dominated  by  the  corresponding  cycle  or  period;  inversely,  if  the  spectrum  has  a 
minimum  at  Ao,  then  the  corresponding  cycle  is  of  less  relevance  for  the  behavior 
of  {xt}  than  all  other  cycles.  For  A  ->  0,  period  P  converges  to  infinity.  A  cycle  with 
an  infinitely  long  period  is  interpreted  as  a  trend  or  a  long-run  component.  Hence, 
/( 0)  indicates  how  strongly  the  process  is  dominated  by  a  trend  component. 

Frequently,  the  analysis  of  the  autocovariance  structure  or  the  autocorrelation 
structure  of  a  process  is  called  “analysis  in  the  time  domain”  as  y(h)  measures  the 
direct  temporary  dependence  between  xt  and  xt+k .  Correspondingly,  the  spectral 
analysis  is  often  referred  to  as  “analysis  in  the  frequency  domain”.  Proposition  4.1 
and  the  definition  in  (4.3)  show  how  to  move  back  and  forth  between  time  and 
frequency  domain. 


Examples 

Example  4. 1  ( White  Noise )  Let  us  consider  the  white  noise  process  xt  —  st  being 
free  from  serial  correlation.  By  definition  it  immediately  follows  that  the  spectrum 
is  constant: 


fs(X)  =  o2/ln ,  A  €  [0, 7T] . 

According  to  Proposition  4.1  all  frequencies  account  equally  strongly  for  the 
variance  of  the  process.  Analogously  to  the  perspective  in  optics  that  the  “color” 
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white  results  if  all  frequencies  are  present  equally  strongly,  serially  uncorrelated 
processes  are  also  often  called  “white  noise”.  ■ 

Example  4.2  (Season)  Let  us  consider  the  ordinary  seasonal  MA  process  from 
Example  3.1, 


*t  —  £t  +  bst-s 


with 


Y  (0)  =  a2  (1  +  b2) ,  y(S)  =  o2b 

and  y  (h)  —  0  else.  By  definition  we  obtain  for  the  spectrum  from  (4.3) 

2i rf  (A)  =  y  (0)  +  2 y  (S)  cos  (A S) 


or 


/  (A)  =  (l  +  b2  +  2b  cos  (AS))  a2 /2tt. 


In  Problem  4.2  we  determine  that  there  are  extrema  at 


7 r  2tz 
0,  — 
s  s 


(S~  1)  7T 


,  71 


The  corresponding  values  are 


/  (0)  -  / 


2n 


=  ...  =  (  1  +  b)2  g2 / 2n  , 


'®='( 


3tt 


—  ...  —  (l  —  b)2  a2 / 2 7t. 


Depending  on  the  sign  of  b ,  maxima  and  minima  are  followed  by  each  other, 
respectively.  In  Fig.  4.2  we  find  two  typical  shapes  of  the  spectrum  of  the  seasonal 
MA  process  for  S  =  4  (quarterly  data)  with  b  =  0.7  and  b  =  —0.5.  First,  let 
us  interpret  the  case  b  >  0.  There  are  maxima  at  the  frequencies  0,  7t/2  and  n . 
Corresponding  cycles  are  of  the  period 


2n 

Po  =  —  =  OO,  P 1 


5The  variance  of  the  white  noise  is  set  to  one  ,  a2  =  1.  This  is  also  true  for  all  spectra  of  this 
chapter  depicted  in  the  following. 
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MA(4)  with  b=0.7  and  b=-0.5 


o 


MA(1)  with  b=0.7  and  b=-0.5 

o 


Fig.  4.2  Spectra  (2nf(X))  of  the  MA (S)  process  from  Example  4.2 


The  trend  is  the  first  infinitely  long  “period”.  The  second  cycle  has  the  period  P\  — 
4,  i.e.  four  quarters  which  is  why  this  is  the  annual  cycle.  The  third  cycle  with 
P2  —  2  is  the  semi-annual  cycle  with  only  two  quarters.  These  three  cycles  dominate 
the  process  for  b  >  0.  Inversely,  for  b  <  0  it  holds  that  these  very  cycles  add 
particularly  little  to  the  variance  of  the  process.  ■ 


Example  4.3  (MA(1))  Specifically  for  S  —  1  the  seasonal  MA  process  passes  into 
the  MA(1)  process.  Accordingly,  one  obtains  two  extrema  at  zero  and  n\ 

f  (0)  =  (1  +  b )2  a2/ 2n  ,  /  (tt)  =  (1  —  b )2  a2/ 2tt. 

In  between  the  spectrum  reads 

/  (A)  =  (l  +  b2  +  2b  cos  (A))  a2/ 2n. 

For  b  —  0.7  and  b  —  —0.5,  respectively,  the  spectra  were  calculated,  see  Fig.  4.2. 
For  b  <  0  one  spots  the  relative  absence  of  a  trend  (frequency  zero  matters  least) 
while  for  b  >  0  precisely  the  long-run  component  as  a  trend  dominates  the  process. 
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maximum  at  frequency  lambda  =  0.73 


[0,  pi] 

Fig.  4.3  Spectrum  (2n  f{X))  of  business  cycle  with  a  period  of  8.6  years 


Example  4.4  ( Business  Cycle )  The  spectrum  is  not  only  used  for  modelling  sea¬ 
sonal  patterns  but  as  well  for  determining  the  length  of  a  typical  business  cycle. 
Let  us  assume  a  process  with  annual  observations  having  the  spectrum  depicted  in 
Fig.  4.3.  The  maximum  is  at  A  =  0.73.  How  do  we  interpret  this  fact  with  regard 
to  contents?  The  dominating  frequency  A  =  0.73  corresponds  to  a  period  of  about 
8.6  (years).  A  frequency  of  this  magnitude  is  often  called  “business  cycle  frequency” 
being  interpreted  as  the  frequency  which  corresponds  to  the  business  cycle.  In  fact, 
Fig.  4.3  does  not  comprise  an  empirical  spectrum.  Rather,  one  detects  the  theoretical 
spectrum  of  the  AR(2)  model  whose  autocorrelogram  is  depicted  in  Fig.  3.4  down 
to  the  right.  The  cycle,  which  can  be  seen  in  the  autocorrelogram  there,  translates 
into  the  spectral  maximum  from  Fig.  4.3.  ■ 


4.3  Filtered  Processes 

The  ARMA  process  or  more  generally  the  infinite  MA  process  have  been  defined  as 
filtered  white  noise.  In  order  to  systematically  derive  a  formal  expression  for  their 
spectra,  we  start  quite  generally  with  the  relation  between  input  and  output  of  a  filter 
in  the  frequency  domain. 
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Filtered  Processes 

As  in  the  previous  chapter,  we  consider  the  causal,  time-invariant,  linear  filter  F(L ), 

p 

F(L)  =  J2  WjLj, 

7=0 

where  L  again  denotes  the  lag  operator  and  p  —  o o  is  allowed  for.  The  filter  is 
assumed  to  be  absolutely  summable,  which  trivially  holds  true  for  finite  order  p.  Let 
the  process  {xt}  be  generated  by  filtering  of  the  stationary  process  {et}, 


xt  —  F  (L)  et. 


Then,  how  does  the  corresponding  spectrum  of  {vr}  for  a  given  spectrum  fe  of  { et } 
read?  The  answer  is  based  on  the  power  transfer  function  7>  (A)  that  we  briefly 
touched  upon  in  the  previous  section6 : 

oo  oo  oo 

Tp  (A)  =  Wj  +  2  WjWj+h  cos  (A h)  .  (4.4) 

7=0  h=lj=0 

At  a  first  glance,  this  expression  appears  cumbersome.  However,  in  the  next  section 
we  will  see  that  for  concrete  ARMA  processes  it  simplifies  radically.  If  F  ( L )  is  a 
finite  filter  (i.e.  with  finite  p ),  then  the  sums  of  TV  (A)  are  truncated  accordingly, 
see  (4.8)  in  the  next  section.  With  TF  (A)  the  following  proposition  considerably 
simplifies  the  calculation  of  theoretical  spectra  (for  a  proof  of  an  even  more  general 
result  see  Brockwell  and  Davis  (1991,  Thm.  4.4.1),  while  Fuller  (1996,  Thm.  4.3.1) 
covers  our  case  where  {et}  has  absolutely  summable  autocovariances). 


6  The  mathematically  experienced  reader  will  find  the  expression  in  (4.4)  to  be  unnecessarily 
complicated  as  the  transformation  TF  (A)  can  be  written  considerably  more  compactly  by  using 
the  exponential  function  in  the  complex  space.  It  holds  that 


TV  (A)  = 


F(e~iX) 


F{eiX)F{e~iX), 


where  Euler’s  formula  allows  for  expressing  the  complex- valued  exponential  function  by  sine  and 
cosine, 


•  "I  r\ 

e  =  cos  A  +  i  sin  A ,  i  =  —  1 , 


with  the  conjugate  complex  number  e~lX  =  cos  A  —  i  sin  A  ,  where  i  denotes  the  imaginary  unit. 
Instead  of  burdening  the  reader  with  complex  numbers  and  functions,  we  rather  expect  him  or  her  to 
handle  the  more  cumbersome  definition  from  (4.4).  By  the  way,  the  term  “p°wer  transfer  function” 


stems  from  calling  F(e  lX)  alone  transfer  function  of  the  filter  F{L),  and  TF{X) 
being  the  power  thereof. 


F{e~iX ) 


2 


86 


4  Spectra  of  Stationary  Processes 


Proposition  4.2  (Spectra  of  Filtered  Processes)  Let  {et}  be  a  stationary  process 
with  spectrum  fe  (A).  The  filter 

oo 

j= 0 


oo 


be  absolutely  summable,  ^  \  Wj  \<  oo,  and  {xt}  be 

j= o 


Xt  =  F  (L)  et. 


Then,  is  stationary  with  spectrum 


fix  (A)  =  Tf  (A )fe  (A)  ,  A  g  [0,  7i ] , 


where  TF  (A)  is  defined  in  (4.4). 

Example  4.5  ( Infinite  MA)  Let  et  —  st  from  Proposition  4.2  be  white  noise  with 

fs  (A)  =  <j2/2ti  , 

and  consider  an  absolutely  summable  MA(oo)  process, 

oo 

x,  =  c  (L)  s,  =  y2  cjs‘-j  ■ 

j= 0 


Then  Proposition  4.2  kicks  in: 


oo 


oo  oo 


EL2AL  cicJ+h  cos 

j= 0  h=  1 7=0 


A  G  [0,  7l]  . 


(4.5) 


This  special  case  of  Proposition  4.2  will  be  verified  in  Problem  4.3.  Note  that  the 
spectrum  given  in  (4.5)  equals  of  course  the  result  from  Proposition  4.1  with  (4.2), 
which  continues  to  hold  without  absolute  summability.  ■ 


Persistence 

We  now  return  more  systematically  to  the  issue  of  persistence  that  we  have  touched 
upon  in  the  example  of  the  AR(1)  process  in  the  previous  chapter.  Loosely  speaking, 
we  understand  by  persistence  the  degree  of  (positive)  autocorrelation  such  that 
subsequent  observations  form  clusters:  positive  observations  tend  to  be  followed 
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by  positive  ones,  while  negative  observations  tend  to  induce  negative  ones.  With 
persistence  we  try  to  capture  the  strength  of  such  a  tendency,  which  depends  not 
only  on  the  autocorrelation  coefficient  at  lag  one  but  also  on  higher  order  lags.  In 
the  previous  chapter  we  mentioned  that  it  has  been  suggested  to  measure  persistence 
by  means  of  the  cumulated  impulse  responses  CIR  defined  in  (3.3).  This  quantity 
shows  up  in  the  spectrum  at  frequency  zero  by  Proposition  4.2.  Assume  that  {xt}  is 
an  MA(oo)  process  with  absolutely  summable  impulse  response  sequence  {cj}.  We 
then  have: 


2  /  oo  oo  oo 

/*(  )  c^2tt  IE  Ci+2EE  cici+h 


cr 


j= o 


h=  1  7=0 


2tt 


a 


2 


2  ix 


9 


or 


MO)  =  (CIR)2  —  . 

ZTt 

Hence,  the  larger  CIR ,  the  stronger  is  the  contribution  of  the  trend  component  at 
frequency  zero  to  the  variance  of  the  process,  which  formalizes  our  concept  of 
persistence.  Cogley  and  Sargent  (2005)  applied  as  relative  spectral  measure  for 
persistence  the  ratio  of  2Ttfx(0)  /  yx(0)  pioneered  previously  by  Cochrane  (1988); 
it  can  be  interpreted  as  a  variance  ratio  and  is  hence  abbreviated  as  VR : 


(4.6) 


In  the  case  of  a  stationary  AR(1)  process,  xt  —  a\xt-\  +  st ,  it  holds  that  (see 
Problem  4.6) 


VR  = 


1  —  a  \ 

(1  -  Gi)2 


1  Cl\ 

1  —  a\ 


(  >  1  if  a\  >  0 
=  liffli  =0  • 
[  <  1  if  a\  <  0 


(4.7) 


In  the  case  of  a\  —  0  (white  noise)  we  have  no  persistence,  and  VR  =  1.  For  a\  >  0 
the  process  is  all  the  more  persistent  the  larger  a\  is.  Following  Hassler  (2014),  one 
may  say  that  a  process  has  negative  persistence  if  VR  <  1.  The  plot  of  a  series 
under  negative  persistence  will  typically  display  a  zigzag  pattern  as  observed  in  the 
last  plot  in  Fig.  3.2.  The  limiting  cases  of  VR  —  0  (also  called  antipersistent)  and 
VR  —  oo  (also  called  strongly  persistent)  will  be  dealt  with  in  Chap.  5. 
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ARMA  Spectra 

As  a  consequence  of  the  previous  proposition,  we  can  derive  what  the  spectrum  of 
a  stationary  ARMA  process  {xt}  looks  like.  Remember  the  definition  from  (3.13), 

A  (L)  xt  =  B  (L)  et . 


Now,  define 


yt=  A  (L)  xt  =  B  (L)  Et . 

By  Proposition  4.2  one  obtains  for  the  spectra 

fy(  A)  =  (A)  fx(X)  =  Tb  (A)  a1 1  In  . 

The  assumption  of  a  stationary  MA(oo)  representation  implies  Ta  (A)  >  0. 
Consequently,  one  may  solve  for/*  rendering  the  following  corollary. 

Corollary  4.1  (ARMA  Spectra)  Let  {xr}  be  a  stationary  ARM A(p,  q)  process 

A  (Zf)  Xf  —  v  T  B  (T)  Sf. 


Its  spectrum  is  given  by 


Tb( A)  a2 

7A  (A)  2;r  ’ 


A  £  [0,  7r]  , 


where  TB  (A)  Ta  (A)  are  t/ze  power  transfer  functions  ofB  (L)  and  A  (L). 


Often,  we  restrict  the  class  of  stationary  ARMA  processes  to  the  invertible 
ones,  meaning  we  assume  that  the  moving  average  polynomial  B(L )  satisfies  the 
invertibility  condition  of  Proposition  3.3:  All  solutions  of  B(z)  —  0  are  larger  than 
1  in  absolute  value.  This  implies  as  in  Footnote  7  that  TB  (A)  >  0,  such  that  the 
invertible  ARMA  spectrum  is  strictly  positive  for  all  A. 


7  According  to  Proposition  3.5  we  rule  out  autoregressive  roots  on  the  unit  circle,  such  that 

A(e~lX)  0,  and  |A(^_jA)|  >  0.  By  assumption,  \z\  =  1  implies  A{z)  f2  0,  and  here,  z  =  e~lX 
with 


e  lX\2  =  (cos  A)2  +  (sin A)2  =  1 . 
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In  the  next  section  we  will  learn  that  the  calculation  of  the  functions  7A  (A) 
and  Tb  (A)  and  thereby  the  calculation  of  the  spectra  do  not  pose  any  problem,  cf. 
Eq.  (4.8). 

4.4  Examples  of  ARMA  Spectra 

The  ARMA  filters  A(L)  and  B(L)  are  assumed  to  be  of  finite  order.  In  order  to 
calculate  the  spectrum,  the  power  transfer  function  is  needed  due  to  Corollary  4.1. 
Thus,  next  we  will  get  to  know  a  simple  trick  allowing  for  quickly  calculating  the 
power  transfer  function  of  a  finite  filter. 


Summation  over  the  Diagonal 

We  consider  for  finite  p  the  filter  F(L )  with  the  coefficients  wo,  w\ ,  ...  ,  wp  being 
collected  in  a  vector: 


/  w0  \ 

W 1 

Wp-1 

\  Wp  / 


The  outer  product  yields  a  matrix  where  w'  stands  for  the  transposition  of  the 
column  w: 


ww' 


(  W0  \ 

W\ 


(w0,  Wi, . . . ,  Wp-1,  Wp) 


Wp- 1 

V  Wp  ) 

(  Wq  W0Wi 

W]  Wo  w\ 


WoWp-x  WoWp  \ 
W\Wp-\  W\  Wp 


Wp-xWo  Wp-xWx 
\  Wp  Wq  WpW\ 


W2p_x  Wp-xWp 
WpWp-X  W2p  ) 


Obviously,  the  matrix  is  symmetric.  Now,  we  add  the  cosine  as  function  of  |  j  —  i |, 
cos  (A  |  j  —  i  |),  to  the  entries  w(Wj.  Let  the  resulting  matrix  be  called  Mp  (A).  It 
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becomes: 


/  Wq  cos(0) 
wqW\  cos  (A) 


W()W\  cos  (A) 
w\  cos(0) 


woWp-i  cos  (A  {p  —  1))  w\wp-\  cos  (A  (p  —  2)) 
y  woWp  cos  (A p)  w\wp  cos  (A  (p—  1)) 


woWp  cos  (A p)  ^ 

w\wp  cos  (A  (p  —  1)) 


wp_ ,  wp  cos  (A) 


wp  cos(0) 


7 


The  rule  for  calculating  TF  (A)  reads  in  words:  “Add  up  the  sums  over  all  diagonals 
of  Mf  (A)”: 

[wq  +  . . .  wj\  +  2  [w0wi  H - Wp-\ wp]  cos  (A)  H - h  2  [w0w^]  cos  (A p) . 

This  corresponds  exactly  to  (4.4)  for  finite  p : 


p 


Tf  (A)  =  J2  +  2  J2 


7=0 


h=  1 


p—h 

J2  wjwj+h 


7=0 


cos  (A h) 


(4.8) 


AR(1)  Spectra 

The  autoregressive  polynomial  of  order  one  reads 


A  (L)  =  1  —  a\L, 


i.e.  the  filter  coefficients  are 


Wq  —  1  and  w\  —  —a\ . 

Hence,  for  the  power  transfer  function,  (4.8)  provides  us  with 

Ta (A)  =  1  +  a\  —  2a\  cos(A) , 
and  Corollary  4. 1  yields  for  the  spectrum 


2tt/(A)  = 


CT 


1  +  a\  —  2a\  cos(A) 


In  Problem  4.4  we  will  show  that  there  are  extrema  at  A  =  0  and  A  =  7T,  where 
the  slope  of  the  spectrum  is  zero.  For  a\  >  0  the  spectrum  decreases  on  [0, 7r], 
i.e.  the  most  significant  frequency  is  A  =  0:  The  process  is  dominated  by  trending 
behavior.  Figure  4.4  shows  that  this  is  the  more  true  the  greater  a\  is:  The  greater  a\ , 
the  steeper  and  higher  grows  the  spectrum  in  the  area  around  zero.  Mirror- inversely, 
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Fig.  4.4  AR(1)  spectra  with  positive  autocorrelation 


Fig.  4.5  AR(1)  spectra  ( 2n  f{X )),  cf.  Fig.  3.2 


for  a\  <  0  it  holds  that  the  trend  component  matters  least,  see  Fig.  4.5.  The  direct 
comparison  to  the  time  domain  in  Fig.  3.2  is  also  interesting.  The  case  in  which 
a\  >  0  with  the  spectral  maximum  at  A  =  0  translates  in  persistence  of  the  process: 
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Observations  temporarily  lying  close  together  have  similar  numerical  values,  i.e.  the 
autocorrelation  function  is  positive.  For  a\  <  0,  however,  observations  following 
each  other  have  the  tendency  to  change  their  sign  as,  in  this  case,  there  is  just  no 
trending  behavior. 


AR(2)  Spectra 

The  AR(2)  process  is  given  by 


xt  = 


£t 


A(L ) 


r 

with  A(L)  —  1  —  ci\L  —  a2Z/ 


In  Problem  4.5  we  recapitulate  the  principle  of  the  “summation  over  the  diagonal” 
and  thus  we  show 

Ta (A)  =  \  -\-  a\  -\-  a 2  +  2  [a\  (a2  —  1)  cos(A)  —  cz2  cos  (2A)] . 

Therefore,  due  to  Corollary  4.1,  the  corresponding  spectrum  reads 


2  ?r/(A)  = 


cr 


Ta(  A) 


For  a2  —  0  one  obtains  the  AR(1)  case. 

In  Fig.  4.6  spectra  for  four  parameter  constellations  are  depicted;  these  are 
exactly  the  four  cases  for  which  autocorrelograms  are  given  in  Fig.  3.4.  The  top 
left  case  could  be  well  approximated  by  an  AR(1)  process.  This  is  also  roughly  true 
for  the  top  right  case;  however,  closer  inspection  reveals  that  the  AR(2)  spectrum 
is  not  minimal  at  frequency  zero.  Both  the  lower  spectra  entirely  burst  the  AR(1) 
scheme.  On  the  bottom  right  we  have  the  example  of  the  business  cycle,  see  Fig.  4.3. 
The  spectrum  on  the  bottom  left  is  even  more  extreme:  Except  for  a  rather  small 
area  around  A  =  2,  it  is  zero  almost  everywhere  which  is  why  there  is  no  trend 
component.  The  process  is  determined  by  almost  only  one  cycle  which  can  be  seen 
in  the  autocorrelogram  as  well. 


ARMA(1,1)  Spectra 

Now,  let  us  consider  the  two  filters 

A(L)  =  1  -  axL  and  B(L)  =  l+  bxL. 
We  know  the  filter  transfer  function  of  B{L)  from  Example  4.3: 

Tb( A)  =  1  +  b\  +  2b\  cos(A) . 
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0.0  0.5  1.0  1.5  2.0  2.5  3.0  0.0  0.5  1.0  1.5  2.0  2.5  3.0 


al  =0.7,  a2  =  0.1 


al  =  -0.4,  a2  =  0.4 


0.0  0.5  1.0  1.5  2.0  2.5  3.0 

al  =  -1.0,  a2  =  -0.8 


0.0  0.5  1.0  1.5  2.0  2.5  3.0 

al  =  1.0,  a2  =  -0.5 


Fig.  4.6  AR(2)  spectra  (27r/(A)),  cf.  Fig.  3.4 


The  transformation  of  A(L)  was  determined  at  the  beginning  of  this  section.  Due  to 
Corollary  4.1,  we  put  the  spectrum  together  as  follows: 


2nf(X) 


I  -f-  b~^  2  b\  cos(A)  2 

1  +  a\  —  2a\  cos(A) 


A  £  [0,  7t]  . 


In  order  to  have  this  illustrated,  consider  the  examples  from  Fig.  4.7.  The  cases 
correspond  in  their  graphical  arrangement  to  the  autocorrelograms  from  Fig.  3.5. 
The  cases  top  right  and  bottom  left  are  interesting.  At  the  top  on  the  right,  the  entire 
absence  of  a  trend  is  reflected  in  a  negative  autocorrelogram  close  to  zero.  At  the 
bottom  on  the  left,  beside  the  trend,  cycles  of  higher  frequencies  add  to  the  process 
as  well,  the  process  consequently  being  positively  autocorrelated  of  the  first  order 
and  then  exhibiting  an  alternating  pattern  of  autocorrelation. 
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al  =  0.75,  bl  =  0.75  al  =  0.75,  bl  =  -0.9 


al  =-0.5,  bl  =0.9 


0.0  0.5  1.0  1.5  2.0  2.5  3.0 

al  =  -0.5,  bl  =  -0.9 


Fig.  4.7  ARMA(1,1)  spectra  (27r/(A)),  cf.  Fig.  3.5 


Multiplicative  Seasonal  AR  Process 

If  one  wants  to  have  a  decaying  autocorrelation  function  not  dropping  to  zero, 
then  one  does  not  choose  a  pure  MA  model  as  in  Example  4.2.  The  most  basic 
seasonal  autoregressive  model  is  based  on  the  filter  (1  —asLs).  Frequently,  the  trend 
component  is  to  have  an  additional  weight  which  is  why  one  adds  the  AR(1)  factor 
(1  —  a\L ): 


A(L)  =  (1  -  aiL)  (1  -  asLs) 

—  1  —  a\L  —  asLs  +  a\  asLs+1  . 

Therefore,  we  have  an  AR(S  +1)  model  with  parameter  restrictions.  The  spectrum 
is  adopted  from  Problem  4.6  in  which  7A(A)  is  given: 


Ta( A)  ' 


2itf(X)  = 


4.5  Problems  and  Solutions 
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Fig.  4.8  Spectra  (. 2nf(X ))  of  multiplicative  seasonal  AR  processes  (S  =  4) 


In  Fig.  4.8,  we  show  two  examples  for  the  quarterly  case  ( S  —  4).  With  the 
frequencies  X  —  n  and  A  =  tt/2  the  semi-annual  cycles  with  P  —  2  quarters  period 
and  the  annual  cycles  with  P  —  4  quarters  length  are  modelled  (one  also  talks  about 
seasonal  cycles).  As  a\  —  0.5  is  positive  in  both  the  spectra,  the  trend  (at  frequency 
zero)  dominates  the  seasonal  cycles.  The  annual  and  semi-annual  cycles  add  both 
equally  strongly  to  the  variance  of  the  process.  However,  in  the  case  <24  =  0.8,  the 
seasonal  component  is  more  pronounced  than  in  the  case  <24  —  0.5  as  in  the  upper 
spectrum  both  the  seasonal  peaks  are  not  only  higher  than  in  the  lower  one  (note  the 
scale  on  the  ordinate)  but  most  of  all  steeper:  In  the  upper  graph,  the  area  beneath  the 
spectrum  substantially  concentrates  on  the  three  frequencies  0,  7t/2  and  tt,  whereas 
it  is  more  spread  over  all  frequencies  in  the  lower  one. 


4.5  Problems  and  Solutions 

Problems 

4.1  Prove  Proposition  4.1  (e)  under  the  additional  assumption  of  absolute  summa- 
bility. 

4.2  Determine  the  extrema  in  the  spectrum  of  the  seasonal  MA  process  from 
Example  4.2. 
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4.3  Prove  the  structure  of  the  spectrum  (4.5)  for  absolutely  summable  MA(oo) 
processes. 

4.4  Determine  the  extrema  of  the  AR(1)  spectrum. 

4.5  Determine  the  power  transfer  function  TA  (A)  of  the  filter  A  (L)  =  1  —a\L—a2L2 . 

4.6  Determine  the  power  transfer  function  TA  (A)  of  the  multiplicative  quarterly  AR 
filter  A(L)  =  (1  —  a\  L)(  1  —  <24  L4)  =  1  —  a\  L  —  <24  L4  +  a\  <24  L5. 

4.7  Determine  the  persistence  measure  VR  from  (4.6)  for  a  stationary  and  invertible 
ARM A(  1,1)  process.  Discuss  its  behavior  in  particular  for  the  MA(1)  model  (in 
comparison  with  the  AR(1)  case  given  in  (4.7)). 


Solutions 


4.1  We  define  the  entity  A/7  and  will  show  that  it  equals  y  (h).  Due  to  the  symmetry 
of  the  cosine  function  and  of  the  even  spectrum  it  holds  by  definition  that: 


71 


Ah  := 


2  f  f  ( A)  cos  (A h)  dX 
0 


71 


=  f 


—  /  /  (A)  cos  (A h)  dX 


—71 


1  ?  °° 

-s/.£ 


y  (/)  cos  (A/)  cos  (A h)  dX 


—71 


l=—oo 


Because  of  the  absolute  summability,  the  order  of  summation  and  integration  is 
interchangeable: 


_  71 

OO  n 

2tt  Ah  —  y  (/)  /  cos  (A/)  cos  (A h)  dX 


l=—oo 


—71 


71  00  71 

=  y  (0)  /  cos  (A h)  dX  +  2  y  (/)  /  cos  (A/)  cos  (A/z)  dX 
\  1=  1  'A 

— 7T  — 7T 
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For  h  —  0  it  holds  that 

OO 

,  .  sin  (nl)  —  sin  (— nl) 

2n  Ao  —  y  (0)  2n  +  2  y  (/) - - - 

—  27T  y  (0)  +  0. 

Accordingly,  for  h  ^  0  it  holds  that 


2n  Afo  —  0  A  2 


cos  (A  (/  —  h ))  +  cos  (A  (/  +  /*)) 
- a  A 

2 


where  the  trigonometric  formula 

2  cosx  cosy  =  cos  (x  —  y)  +  cos  (x  +  y) 
was  used.  By  this  we  obtain 


2nAh  =  y 


—IT 


(1  +  cos  (2A h))  dX 


as  one  can  see  that  for  k  e  Z\{0}  the  integral  is 


71 

J  cos  (A k)  dX 


—71 


sin  (7 zk)  —  sin  (—7tk) 
k 


So,  we  finally  obtain 


2tz  Ah  —  y  (h)  (2n  +  0)  =  2n  y  ( h ) 

for  h  ^  0  as  well.  Hence,  Ah  —  y(h )  for  all  h ,  and  the  proof  is  complete. 
4.2  The  spectrum 

/  (A)  =  ( 1  +  b2  +  2b  cos  (AS))  a2/ 2n 


8  We  use 


f  sin(Af) 

/  cos(Al)dA  =  — - - , 


and  sin(jtk)  =  0  for  k  G  Z. 


98 


4  Spectra  of  Stationary  Processes 


is  given.  In  order  to  determine  the  extrema,  we  consider  the  derivative, 


f  (A)  =  —2 bS  sin  (AS)  a1  jin. 


with  (5+1)  zeros 


ix  2  ix 
0,  — 
s  s 


(S  -  1)  7X 


,  7X 


on  the  interval  [0,  tt] .  The  sign  of  the  second  derivative  depends  on  b\ 


f"  (A)  =  -2 bSz  cos  (AS)  (jz/2tx. 


One  obtains 


r«»=r(f) 


-2bS2a2/2jt, 


..  =  +2bS2  o2  /  2n . 


Accordingly,  maxima  and  minima  follow  each  other.  For  b  >  0,  the  sequence 
of  extrema  begins  with  a  maximum  at  zero;  for  b  <  0,  one  obtains  a  minimum, 
inversely. 

4.3  The  autocovariances  of 

oo 

cj£t-j 

7=0 

are  known  from  Proposition  3.2: 


oo 

Yx(h)  =  a2J2cjCj+h. 

j= o 


For  the  spectrum  fx  (A)  it  follows: 


2  7X 


X*(0)  ^  Yxify  ( ^  \ 

— —  + 2  .L  — — cos  (A/i> 


h=  1 


oo  oo  oo 

=  ^2  cj  +  2  ^2  ^2  cJcj+h  cos  c^)  • 

7=0  A=1;=0 


Hence,  the  claim  is  verified. 
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4.4  For 

/(A)  =  (1  +  a\  —  2a\  cos(A))  1  o2 /2tz 
one  obtains  by  differentiation 

f'( A)  =  — (l  +  a\  —  2a\  cos(A))  2  (— 2a\)  (—  sin(A))  o2 /2n  . 
Obviously,  candidates  for  extrema  are  A  =  0  and  A  =  n : 

/(0)  =f'(7t)  =0. 

The  sign  of  the  derivative  depends  on  a\  only: 

f'( A)  <  0  ,  A  e  [0,  tv]  a\  >  0  . 

Accordingly, 

/( 0)  =  (1  —  a\)~2  a2 / 2n  and  /( n)  —  (1  +  a\)~2  a2 / 2tz 

are  maxima  and  minima,  depending  on  the  sign  of  a\. 

4.5  With  the  vector  of  coefficients 

( 1 1 

a  =  I  —ci\  I 

V  a2  / 

we  obtain  as  outer  product 


1  —ci\  —Cl2 


aa  —  \  —ci\  a\ 


a\a2 
2 


Cl2  Cl\Cl2  ^2 


Adding  the  cosine,  it  follows  that 


(1  —  a\  cos(A)  —ci2  cos(2A) 

—a\  cos(A)  a2  a\ci2  cos(A) 

— ci2  cos(2A)  a\ci2  cos(A)  a\ 

By  summation  over  the  diagonal  we  obtain  due  to  symmetry 

Ta  (A)  —  lT  T  "F  2  [ — ci\  T  ^2]  cos(A)  T  2  [ — $2]  cos(2A) , 

which  results  from  (4.8)  as  well.  This  is  in  accordance  with  the  result  in  the  text. 
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4.6  Using  (4.8)  with  p  —  5  yields  the  following  expression: 

Ta  (A)  =  1  +  a \  +  $4  +  a 2  $4 

+2  [— a\  —  a\  a2^\  cos(A)  +  2  [a\  a^\  cos(3  A) 

+2  [— ci4  —  a\  ^4]  cos(4  A)  +  2  [<21  <24]  cos(5  A) . 

This  is  simply  an  exercise  in  concentration  and  is  simplified  by  the  following 
equalities: 

Wo  =  1,  W 1  =  —ci\ ,  W2  —  W3  ==  0,  W4  =  —(24,  W5  =  Cl\  CI4  . 


4.7  In  the  previous  section  we  discussed  the  ARM A(  1,1)  process  with  the 
polynomials 


A(L)  =  1  —  a\L  and  B(L)  =  1  +  b\L. 
Evaluating  the  spectrum  given  there  we  have: 


2jt/(0)  = 


1  ~\~  +  2  9 

- i - a 

1  +  (2J  —  2(2i 


(1  +  ^l)2  2 

(1  -ai)2  a 


The  variance  we  copy  from  Chap.  3: 


By  (4.6)  we  obtain 


(1  +  b2  +  2<2iZ?i) 
1  —  a 2 


1  +  (2i  (1  +  Z?i)2 

VR  =  - -  — - - — - . 

1  —  ci  1  1  b2  +  2(2 1  A?  1 

If  b\  —  0,  the  AR(1)  case  from  (4.7)  is  of  course  reproduced.  If  a\  —  0,  the  MA(1) 
case  results  as 


VR  = 


(l+^i)2 
1  +b2 


>  1  if/?i  >  0 

=  1  if  b\  =  0  . 
<  1  if  b\  <  0 


We  hence  have  negative  persistence  for  b  1  <  0,  which  reflects  the  negative 
autocorrelation.  For  b\  >  0,  it  is  straightforward  to  verify  that  VR  is  growing  with 
b\ ,  reaching  a  maximum  value  of  VR  =  2  for  b\  —  1.  This  corresponds  to  the 
persistence  of  an  AR(1)  process  with  a\  —  1/3.  Hence,  the  invertible  MA(1)  process 
with  \b\\  <  1  can  only  capture  very  moderate  persistence  in  comparison  with  the 
AR(1)  case  where  VR  grows  with  a\  beyond  any  limit. 
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Long  Memory  and  Fractional  Integration 


5.1  Summary 

Below  Proposition  3.5  we  saw  that  the  autocorrelation  sequence  of  any  stationary 
ARMA  process  dies  out  at  exponential  rate:  \p(h)\  <  c gh ,  see  (3.14).  This  is 
too  restrictive  for  many  time  series  of  stronger  persistence,  which  display  long 
memory  in  that  the  autocovariance  sequence  vanishes  at  a  slower  rate.  In  some 
fields  of  economics  and  finance  long  memory  is  treated  as  an  empirical  stylized 
fact.  Fractional  integration  as  a  model  for  long  memory  will  be  presented  in 
this  chapter.  In  the  same  paper  where  Granger  (1981)  introduced  the  Nobel  prize 
winning  concept  of  cointegration  (see  Chap.  16)  he  addressed  the  idea  of  fractional 
integration,  too.  For  an  early  survey  on  fractional  integration  and  applications  see 
Baillie  (1996). 


5.2  Persistence  and  Long  Memory 

We  have  already  briefly  touched  upon  the  so-called  random  walk,  see  Eq.  (1.9).  In 
terms  of  the  difference  operator  this  can  be  written  as  Axt  —  su  i.e.  the  process  has 
to  be  differenced  once  to  obtain  stationarity.  Alternatively,  the  process  is  given  by 
a  cumulation  or  summation  over  the  shocks,  xt  —  Yl)=  i  £j>  see  (1-8),  which  is  the 
reason  to  call  the  process  {xt}  integrated  of  order  1,  see  also  Chap.  14.  In  this  section, 
differencing  or  integration  of  order  1  will  be  extended  by  introducing  non-integer 
orders  of  differencing  and  integration. 


!See  e.g.  the  special  issue  edited  by  Maasoumi  and  McAleer  (2008)  in  Econometric  Reviews  on 
“Realized  Volatility  and  Long  Memory”. 
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Persistence 


By  persistence  we  understand  how  strongly  a  past  shock  affects  the  presence 
of  a  stochastic  process.  We  stick  to  the  MA(oo)  representation  behind  the  Wold 
decomposition  of  a  stationary  process  that  we  briefly  touched  upon  in  Chap.  3, 


oo 

CjSt-j  , 

7=0 


oo 


<  OO  , 


with  {st}  forming  a  white  noise  sequence.  The  impulse  responses  coefficients  {cj} 
measures  the  response  of  xt  on  a  shock  j  periods  ago.  With  stationary  processes, 
the  shocks  are  transitory  in  that  lim^oo  cj  —  0.  In  particular,  for  stationary 
ARMA  processes  we  know  that  the  impulse  responses  die  out  so  fast  that  they  are 
summable  in  absolute  value,  see  (3.2).  To  model  a  stronger  degree  of  persistence 
and  long  memory,  we  require  a  slower  convergence  to  zero.  The  model  of  fractional 
integration  of  order  d  will  impose  the  so-called  hyperbolic  decay  rate, 

Cj  =  cjd~l ,  c  ±  0  . 

Under  d  <  1,  the  sequence  {cj}  converges  to  zero.  Clearly,  the  larger  d ,  the  stronger 
is  the  persistence  in  that  jd~{  dies  out  more  slowly.  Hence,  the  parameter  d  measures 
the  strength  of  persistence.  Contrary  to  the  exponential  case  characteristic  of  ARMA 
processes,  hyperbolic  decay  is  so  slow  for  positive  d  >  0,  that  the  impulse  responses 
are  not  summable.  In  Problem  5.1  we  will  establish  the  following  convergence  result 
concerning  the  so-called  (generalized)  harmonic  series,  often  also  called  p-series: 

oo 

y ^J~p  <  oo  if  and  only  if  p  >  1  .  (5.1) 

7=1 

Moreover,  we  will  show  in  Problem  5.2  that  exponential  decay  to  zero  is  faster  than 
hyperbolic  one: 


lim  Trrr  =  0  >  0  <  g  <  1 ,  \d\  <  1 . 

j — >"00  ja  1 

In  order  to  illustrate  the  different  decay  rates,  we  display  in  Figs.  5.1  and  5.2 
sequences  f~ 1  and  1 ,  respectively;  by  construction  they  all  have  the  value  1  at 

j  =  !• 
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d=0.85  d=0.65 


d=0.45 


d=0.25 


Fig.  5.1  f~x  for  d  =  0.85,  0.65,  0.45,  0.25 


For  a  process  to  be  stationary,  we  require  the  impulse  response  sequence  to 
be  square  summable,  see  Proposition  3.2.  From  (5.1)  we  learn  that  Yq^ifd~2 
finite  if  and  only  if  d  <  0.5,  which  hence  turns  out  to  be  the  stationarity  condition 
for  processes  with  impulse  responses  {jd~1}. The  model  of  fractional  integration, 
however,  does  not  directly  assume  q  =  c  jd~l ;  rather  this  power  law  will  hold  only 
true  for  large  y, 


■d—\ 

Cj~  cf  ,  y  ^  o O  , 

where  fory  ->  oo”  is  to  be  understood  as  a  proper  limit  in  the  following  way: 


a 


a;  ~  bj  lim  —  =  1 ,  bj  0 . 


V  ^  ^  L  _  "  ’  "J 

Jj 


j  ->oo  bi 


(5.2) 
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1 

0.8 

0.6 

0.4 

0.2 

0 

Fig.  5.2  gj~l  for  g  =  0.85,  0.65,  0.45,  0.25 


Fractional  Differencing  and  Integration 

With  the  usual  difference  operator  A  =  (1  —  L),  see  Example  3.2,  we  define 
fractional  differences  by  binomial  expansion2: 

,  _  ^  d  (1  —  d)  9  d  (1  —  d)  (2  —  d)  o 

Ad  =  (1  -  L)d  =  1  -  r/L - 2 - -L2  -  — - - — - -L3 - 

v  7  2  6 

oo 

=  TTjlj  ,  J  >  —  1  . 

7=0 


2For  the  rest  of  this  chapter  we  maintain  d  >  —  1,  which  guarantees  that  {jry}  converges  to  0  with 

growing  j,  making  the  inhnite  expansion  meaningful. 
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Readers  not  be  familiar  with  binomial  series  for  d  £  N  may  wish  to  consult  e.g. 
Trench  (2013,  Sect.  4.5).  The  binomial  series  results  from  a  Taylor  expansion  of 
(1  —  z)d  about  z  —  0,  hence  also  called  Maclaurin  series.  The  coefficients  {itj}  are 
given  in  terms  of  binomial  coefficients 

_  d(d—  1)  •  •  •  (d  -j  +  1) 


yielding  the  recursion 

(d\  ,  ,,,•  j  —\  —  d 

TZj  =11  (-1)-'  =  - - -  Jtj- 1 ,  J  >  1  ,  7T0  =  1  . 

For  natural  numbers  d  one  has  the  more  familiar  finite  expansions, 

(1  —  L)1  =  1  —  L,  (1  -L)2  =  1  -2L  +  L2, 


(5.3) 


while  the  expansion  in  (5.3)  holds  for  non-integer  (or  fractional)  values  of  J,  too.  In 
Problem  5.3  we  derive  the  behavior  for  large  j, 


7tj  - 


J 


—d- 1 


r  (-  j)  ’ 


j  ->  OO  ,  t/^0, 


(5.4) 


where  F(-)  is  the  so-called  Gamma  function  introduced  at  greater  detail  below  its 
definition  in  (5.18)  in  the  Problem  section. 

Similarly  to  fractional  differences,  we  may  define  the  fractional  integration 
operator  upon  inversion, 


A~d  =  (1  -  L)~d  =  tjd  ,  ^>  =  1, 

7=o 


where  the  coefficients  are  given  by  simply  replacing  d  by  —  d  in  (5.3): 


j  —  1  +  d 

j 


The  same  arguments  establishing  (5.4)  hence  show 


f_  1 

rid)' 


(5.5) 


j  — >•  oo ,  d  0 . 


(5.6) 
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Fractional  integration  thus  imposes  the  hyperbolic  decay  rate  discussed  above, 
where  the  speed  of  convergence  varies  with  d.  From  (5.1)  we  observe  with  (5.6) 
that  is  summable  if  and  only  if  d  <  0,  in  which  case 


oo 


X  vo-  =  o  -  $ 

7=0 


d  <  0 . 


(5.7) 


Further,  {xj/j}  is  square  summable  if  and  only  if  d  <  0.5.  These  are  the  ingredients 
to  define  a  fractionally  integrated  process. 


5.3  Fractionally  Integrated  Noise 

We  now  apply  the  above  findings  to  the  simplest  case  of  fractional  noise  (which 
is  short  for:  fractionally  integrated  noise),  define  long  memory  in  the  time  domain 
in  (5.9),  and  translate  it  into  the  frequency  domain. 


Fractional  Noise  and  Long  Memory 

In  case  of  fractionally  integrated  noise  the  fractional  differencing  filter  Ad  has  to 
be  applied  to  {jq}  in  order  to  obtain  white  noise  {sj  with  variance  a2:  Adxt  —  st . 
Equivalently,  we  write  after  inverting  the  differences 

X,  =  (1  —  L)~d  s, 

OO 

=  ,  t  e  Z ,  d  <  0.5  ,  (5.8) 

j= o 

with  {i//j}  from  (5.5)  being  the  sequence  of  impulse  response  coefficients  measuring 
the  effect  of  a  past  shock  on  the  presence.  As  we  have  discussed  above  the  impulse 
responses  die  out  the  more  slowly  the  larger  d  is.  In  that  sense  we  interpret  d  as 
measure  of  persistence  or  memory.  The  impulse  responses  die  out  so  slowly  that 
they  are  not  absolutely  summable  for  positive  memory  parameter  d.  Consequently 
for  d  >  0,  {xt}  from  (5.8)  does  not  belong  to  the  class  of  processes  with 
absolutely  summable  autocovariances  characterized  in  (3.2),  while  all  stationary 
ARM  A  processes  belong  to  this  class.  Hence,  fractional  integration  models  for 
d  >  0  a  feature  that  is  not  captured  by  traditional  ARMA  processes,  which  we 
call  strong  persistence;  it  is  defined  in  the  time  domain  by 
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Consequently,  CIR  or  VR  defined  in  (3.3)  and  (4.6),  respectively,  do  not  exist.  Even 
though  the  MA  coefficients  are  not  summable  for  positive  d ,  the  process  is  still 
stationary  as  long  as  d  <  0.5  because  JA  <  oo  due  to  (5.6)  and  (5.1).  Further,  the 
process  is  often  called  invertible  for  d  >  —0.5  since  an  autoregressive  representation 
exists  that  is  square  summable  : 

oo  oo 

Y  JtjXt-j  —  £t  with  ^2  nj<oo. 

j= 0  7=0 

Note  that  the  existence  of  an  autoregressive  representation  in  the  mean  square  sense 
does  not  require  square  summability;  in  fact,  Bondon  and  Palma  (2007)  extend  the 
range  of  invertibility  in  the  mean  square  to  d  >  —1.  Given  the  existence  of  the 
MA(oo)  representation  the  following  properties  of  fractionally  integrated  noise  are 
proven  in  the  Problem  section. 

Proposition  5.1  (Fractional  noise,  time  domain)  For  fractionally  integrated 
noise  from  (5.8)  it  holds  with  —  1  <  d  <  0.5  that 

(a)  the  variance  equals 


y(0)  =  y(0;  d)  =  a 


2  r(i  -  id) 

(ra-dyy 


with  F(-)  being  defined  in  (5.18),  and  y(0;  d)  achieves  its  minimum  for  d  —  0; 

(b)  the  auto  covariances  equal 

h  —  \  d 

y(h)  =  — - - —  y(h-l),  h=  1,2,..., 

h  —  d 

~  Yd(J2h2d~l,  h  — >  oo  ,  d  0  , 

with 


Yd  = 


r(i  -  id) 
r(d)r(\  -  d)  ’ 


where  Yd  <  0  if  and  only  if  d  <  0; 

(c)  the  autocorrelations  p(h)  —  p(h;  d)  grow  with  dfor  d  >  0. 

Let  us  briefly  comment  those  results.  First,  since  F(  1)  =  1,  the  minimum 
variance  obtained  in  the  white  noise  case  (d  —  0)  is  of  course  y  (0;  0)  =  a2.  Second, 


3  A  more  technical  exposition  can  be  found  in  Brockwell  and  Davis  (1991,  Thm.  13.2.1)  or  Giraitis, 
Koul,  and  Surgailis  (2012,  Thm.  7.2.1),  although  they  consider  only  the  range  \d\  <  0.5. 


110 


5  Long  Memory  and  Fractional  Integration 


from  the  hyperbolic  decay  of  the  autocovariance  sequence  we  observe  that  y(h) 
converges  to  zero  with  h  as  long  as  d  <  0.5,  but  for  d  >  0  so  slowly,  that  we  have 
long  memory  defined  as 

H 

yj  |  y  (h)  |— >►  oo  ,  H  ->  oo  if  d  >  0  .  (5.9) 

h= 0 


In  particular,  the  autocovariances  die  out  the  more  slowly  the  larger  the  memory 
parameter  d  is.  Obviously,  the  same  feature  can  be  rephrased  in  terms  of  autocor¬ 
relations.  The  recursion  carries  over  to  the  autocorrelations,  and  Proposition  5.1  (b) 
yields 


P(h)  ~  — ^-rh24  \  h  oo. 

Y(0) 

For  a  numerical  and  graphical  illustration  see  Fig.  5.3.  The  asymptotic  constant  y d 
has  the  same  sign  as  d ,  meaning  that  in  case  of  long  memory  the  autocovariances 
converge  to  zero  from  above,  and  vice  versa  from  below  zero  for  d  <  0,  see  again 


d=0.45  d=0.25 


0 


5  10 

d=-0.45 


15 


Fig.  5.3  p(h)  from  Proposition  5.1  for  d  =  0.45,  0.25,  —0.25,  —0.45 
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Fig.  5.3.  Note,  however,  that  yd  collapses  to  zero  as  d  ->  0,  simply  meaning  that  the 
hyperbolic  decay  rate  does  not  hold  for  d  —  0.  Third,  a  similar  effect  that  d  has  on 
p(h)  at  long  lags,  holds  true  for  finite  h.  More  precisely,  Proposition  5.1  (c)  says  for 
each  finite  h  that  the  autocorrelation  grows  with  d  (for  d  >  0),  which  reinforces  the 
interpretation  of  d  as  measure  of  persistence  or  long  memory.4 5 

The  case  of  negative  d  results  in  short  memory  in  that  the  autocovariances  are 
absolutely  summable,  which  is  clear  again  from  the  p-series  in  (5.1).  This  case  is 
sometimes  called  antipersistent,  the  reason  for  that  being 

oo 

ys  =  0  ,  if  d  <  0  . 

7=1 

This  property  translates  into  a  special  case  of  short  memory,  namely 

oo 

y  p(h)  =  0  ,  if  d  <  0  , 

h=—oo 

as  we  will  become  obvious  from  the  spectrum  at  frequency  zero. 


Long  Memory  in  the  Frequency  Domain 

It  is  obvious  from  the  definition  of  the  spectrum  in  (4.3)  that  it  does  not  exist  at  the 
origin  under  long  memory  ( d  >  0),  because  the  autocovariances  are  not  summable. 
Still,  the  previous  chapter  has  been  set  up  sufficiently  general  to  cover  long  memory, 
see  (4.1).  Given  a  singularity  at  frequency  A  =  0,  one  still  may  determine  the  rate 
at  which /(A)  goes  off  to  infinity  as  A  approaches  0.  To  determine/,  we  have  to 
evaluate  the  power  transfer  function  of  (1  —  L)~d  from  Proposition  4.1  and  obtain 

T(X_L)-d(  A)  =  (2  —  2cos(A))“r/  =  |4sin2 


4More  complicated  is  the  effect  of  changes  in  d  if  d  <  0,  see  Hassler  (2014). 

5  Readers  not  familiar  with  complex  numbers,  i2  =  —1,  may  skip  the  following  equation,  see  also 
Footnote  6  in  Chap.  4: 

A)  =  (1  -  eard(l  -  e~a)~d 
=  (1  -  ea  -  e~a  +  l)-rf 
=  (2  —  2cos(A))— 1 d. 
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where  the  trigonometric  half-angle  formula  was  used  for  the  second  equality: 

2  sin2(x)  =  1  —  cos(2v) .  (5.10) 


We  hence  have  the  following  result. 


Proposition  5.2  (Fractional  noise,  frequency  domain)  Under  the  assumptions  of 
Proposition  5.1  it  holds  for  the  spectrum  of  fractional  noise  xt  =  (1  —  L)~dst  that 


and 


A  >  0 , 


(5.11) 


/(A) 


(5.12) 


The  second  statement  in  Proposition  5.2  is  again  understood  to  be  asymptotic: 
Similarly  to  (5.2)  we  denote  for  two  function  a(x)  and  b(x)  ^  0: 


a(x)  b(x)  for  v  — >  0 


(5.13) 


Since  limx^0  sin(x)/x  =  1  we  write  sin(v)  ~  v  for  v  ->  0.  Consequently,  (5.12) 
arises  from  (5.11). 

From  Proposition  5.2  we  learn  that  long  memory  ( d  >  0)  translates  into  a  spectral 
singularity  at  frequency  zero,  and  the  negative  slope  is  the  steeper  the  larger  d 
is.  In  other  words:  the  longer  the  memory,  the  stronger  is  the  contribution  of  the 
long-run  trend  to  the  variance  of  the  process.  The  antipersistent  case  in  contrast  is 
characterized  by  the  opposite  extreme:  /(0)  =  0.  For  an  illustration,  have  a  look  at 
Fig.  5.4. 


Example  5.1  ( Fractionally  Integrated  Noise )  Although  the  fractional  noise  is  dom¬ 
inated  by  the  trend  component  at  frequency  zero  (strongly  persistent)  for  d  >  0, 
the  process  is  stationary  as  long  as  d  <  0.5.  Consequently,  a  typical  trajectory 
can  not  drift  off  but  displays  somehow  reversing  trends.  In  Fig.  5.5  we  see  from 
simulated  data  that  the  deviations  from  the  zero  line  are  stronger  for  d  —  0.45 
than  for  d  —  0.25.  The  antipersistent  series  ( d  —  —0.45),  in  contrast,  displays  an 
oscillating  behavior  due  to  the  negative  autocorrelation.  ■ 
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d=0.25 


d=-0.45 

Fig.  5.4  2nf(X)  from  Proposition  5.2  for  d  =  0.45,  0.25,  —0.25,  —0.45 


5.4  Generalizations 

On  top  of  long  memory  as  implied  by  fractional  integration  for  0  <  d  <  0.5, 
we  now  want  to  allow  for  additional  short  memory.  We  assume  that  Ad  has  to  be 
applied  to  {xt}  in  order  to  obtain  a  short  memory  process  { et }:  Adxt  —  et.  At  the 
end  of  this  section,  the  order  of  integration  d  will  be  extended  beyond  d  —  0.5  to 
cover  nonstationary  processes,  too.  Thus  we  define  general  fractionally  integrated 
processes  of  order  d ,  in  shorty  ~  1(d).6 


6The  use  of  with  a  differing  meaning  from  that  one  in  (5.2)  should  not  be  a  source  for 
confusion. 
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0  20  40  60  80  100  120  140  160  180  200 

d=0.25 


Fig.  5.5  Simulated  fractional  noise  for  d  =  0.45,  0.25,  —0.45 


Fractionally  Integrated  ARMA  Processes  (ARFIMA) 

Since  the  papers  by  Granger  and  Joyeux  (1980)  and  Hosking  (1981),  it  is  often 
assumed  that  {< et }  is  a  stationary  and  invertible  ARMA (p,  q)  process,  A(L)et  — 
B(L)st ,  with  spectrum 


feW  = 


Tb( A)  a2 
Ta  (A)  2 it  ’ 


see  Corollary  4.1.  An  ARFIMA(p,r/,^)  process  is  defined  by  replacing  st  in  (5.8)  by 
eu  such  that 


A(L)Adx,  =  B(L)st . 

With  the  expansion  from  (5.5)  one  obtains 

oo 

Xt  —  (1  A)  ^  ^  > 

j= 0 


t  e  Z ,  d  <  0.5  . 


(5.14) 


5.4  Generalizations 
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Under  stationarity  and  invertibility  of  the  ARMA  process,  the  spectrum  fe  of  {« et }  is 
bounded  and  bounded  away  from  zero  everywhere: 

0  <fe  (A)  <  oo,  A  G  [0, 7 r] .  (5.15) 

The  results  from  Propositions  5.1  and  5.2  carry  over  to  the  ARFIMA(p,<i,g)  process, 
see  also  Brockwell  and  Davis  (1991,  Sect.  13.2). 

Proposition  5.3  (ARFIMA)  Let  the  ARMA  process  with  A(L)et  =  B(L)st  be 
stationary  and  invertible.  Then  the  ARFIMA  process  with  A(L)Adxt  —  B(L)st  and 
—  1  <  d  <  0.5  is  stationary,  and  it  holds  that 

(a)  the  spectral  density  /(A)  is  given  as 

f(X)  =  4-d&wr2d(T\fe(X),  A  >  0 

~  A-2V,(0) ,  A  -*  0 ; 


(b)  the  auto  covariances  satisfy 

y(h)  ~  ydfe( 0)  2 n  h2d~\  h  ->  oo ,  d  ±  0 , 

with  yd  from  Proposition  5.1. 

Hosking  (1981,  Thm.  2)  and  Brockwell  and  Davis  (1991,  Thm.  13.2.2)  cover 
only  \d\  <  0.5,  but  their  proof  carries  over  to  —1  <  d  <  —0.5.  Further,  they  state 
only  y(h)  ~  Ch2d~l  for  some  constant  C  ^  0;  looking  at  the  details  of  the  proof 
of  Brockwell  and  Davis  (Thm.  13.2.2),  however,  it  turns  out  that  C  =  Ydfei 0) 
which  of  course  covers  the  case  of  Proposition  5.1,  too.  In  particular,  we  find  again 
that  long  memory  defined  by  a  non-summable  autocovariance  sequence  translates 
into  a  spectral  peak  at  A  =  0.  This  feature  occurs  for  d  >  0. 


Semiparametric  Models 

The  parametric  assumption  that  {et}  is  an  ARMA  process  is  by  no  means  essential 
for  Proposition  5.3  to  hold.  More  generally,  we  now  define  a  stationary  process  {et} 
to  be  integrated  of  order  0,  et  ~  1(0),  if 

oo  oo  oo 

e ,  =  F^bkSt-k ,  with  ^  \bk\  <  oo  and  ^ bk  ^  0 , 

k=0  k=0  k=0 


(5.16) 
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where  bo  —  1  and  {e*}  ~  WN(0,  cr2).  The  absolute  summability  of  {bk}  rules  out 
long  memory  (or  d  >  0)  in  that  the  autocovariances  of  {et}  are  absolutely  summable, 
see  Proposition  3.2;  the  second  condition  that  the  sequence  {bk}  does  not  sum  up  to 
zero  rules  out  that  {et}  is  integrated  of  order  d  with  d  <  0,  see  (5.7).  This  motivates 
to  call  processes  from  (5.16)  integrated  of  order  0.  Consequently,  {xt}  from  (5.14) 
with  {et}  from  (5.16)  is  called  integrated  of  order  d,  xt  ~  1(d).  The  spectrum  of  {et} 
is  of  course  given  by  Proposition  4.2,  see  (4.5).  With  fe  being  the  spectrum  of  the 
7(0)  process,  Proposition  5.3  continues  to  hold  without  changes,  provided  0  <  d 
(see  Giraitis  et  al.,  2012,  Prop.  3.1.1). 

A  further  question  is  the  behavior  of  the  impulse  responses  of  a  general  1(d) 
process  without  parametric  model:  Does  the  decay  rate  jd~l  of  from  A~d  carry 

over?  The  answer  is  almost  yes,  but  mild  additional  assumptions  have  to  be  imposed 
on  {bk}  from  (5.16).  Denote 


oo  oo 

xt  —  A  et  —  i/jet-j  =  cj£t-j > 

j= 0  j= o 

where  the  MA  coefficients  are  given  by  convolution: 

j 

Cj  —  ^  '  bk^j—k^  j  A  0. 
k=o 


Hassler  and  Kokoszka  (2010)  prove  that  a  necessary  and  sufficient  condition  for 


Cj  - 


bk  .d- 1 

r(d) 


d  >  0 , 


is  under  long  memory 


kl~dbk^  0,  k  — >  oo.  (5.17) 

This  is  a  very  weak  condition  satisfied  by  all  stationary  ARMA  models  and  most 
other  processes  of  practical  interest.  Hassler  (2012)  proves  that  this  condition 
remains  necessary  in  the  case  of  antipersistence,  d  <  0,  and  establishes  a  mildly 
stronger  sufficient  condition. 

The  statistical  literature  often  refrains  from  a  fractionally  integrated  model  of  the 
type.*;  =  (1  -L)~de„  and  directly  assumes  for  the  corresponding  spectral  behavior 
in  a  vicinity  of  the  origin: 


/(A)  ~  X~2d g{X)  ,  A^0. 
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When  it  comes  to  estimation  of  d ,  technical  smoothness  restrictions  are  imposed  on 
g ,  including  of  course  the  minimum  assumption 

0  <  g(0)  <  00  , 


which  is  required  to  identify  d. 


Nonstationary  Processes 

The  simplest  way  to  define  a  process  that  is  integrated  of  a  higher  order  than  d  <  0.5 
is  as  follows.  Consider  a  sequence  {yt}  that  has  to  be  differenced  once  in  order  to 
obtain  an  7(5)  process,  Ayt  —  xt,  xt  ~  1(8)  or  yt  —  yt-\  +  xt.  Given  a  starting  value 
yo  =  0,  the  solution  of  this  difference  equation  for  t  E  {1,2,...,  /i}  is 

t 

y,  =  f  =  1-2 

j=  i 

Since  {y?}  is  given  by  integration  over  an  1(8)  process,  we  say  that  {yj  is  integrated 
of  order  d ,  yt  ~  1(d) ,  with  d  —  5  +  1.  For  d  >  0.5,  i.e.  5  >  —0.5,  the  process 
{yj  is  necessarily  nonstationary.  We  illustrate  this  type  of  nonstationarity  or  drift  by 
means  of  an  example. 

Example  5.2  (Nonstationary  Fractional  Noise )  The  middle  graph  in  Fig.  5.6  dis¬ 
plays  a  realization  of  a  random  walk  (d  —  1).  It  drifts  off  from  the  zero  line  for  very 
long  time  spans  and  crosses  only  a  few  times.  The  7(1.45)  process  drifts  even  more 
pronouncedly  displaying  a  much  smoother  trajectory  than  the  random  walk.  The 
7(0.55)  process  does  not  drift  as  strongly,  hitting  the  zero  line  much  more  often. 
In  fact,  comparing  the  7(0.55)  series  with  the  7(0.45)  case  from  Fig.  5.5,  one  can 
imagine  that  it  may  be  hard  to  tell  apart  stationarity  and  nonstationarity  in  finite 
samples.  ■ 

The  case  of  8  =  0  or  yt  ~  7(1)  is  of  particular  interest  in  many  financial  and 
economic  applications.  Hence,  one  may  wish  to  test  whether  a  process  is  7(1)  or 
not, 


77o  :  d  —  1  vs.  H\  :  d  ^  1 . 

One  method  to  discriminate  more  specifically  between  d  —  1  and  d  —  0  is 
the  celebrated  test  by  Dickey  and  Fuller  (1979),  see  Chap.  15.  In  a  fractionally 
integrated  framework  it  is  more  generally  possible  to  decide  e.g.  whether  a  process 
is  nonstationary  or  not, 


770  :  d  >  0.5  vs.  H\  :  d  <  0.5  , 
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Fig.  5.6  Nonstationary  fractional  noise  for  d  =  1.45, 1.0,  0.55 

or  whether  a  process  has  short  memory  or  not, 

Hq  :  d  <  0  vs.  H\  :  d  >  0 . 

Demetrescu,  Kuzin,  and  Hassler  (2008)  suggested  a  simple  procedure  similar  to  the 
Dickey-Fuller  test,  to  test  for  arbitrary  values  do. 

5.5  Problems  and  Solutions 

Problems 

5.1  Consider  the  p-series  Ylj=  i  J~p  for  /  ->  oo.  Show  that  the  limit  is  finite  if  and 
only  if  p  >  1,  see  (5.1). 

5.2  Show  that  exponential  decay  is  faster  than  hyperbolic  decay,  i.e. 

gj 

lim  -  =  0  for  0  <  g  <  1 ,  \d\  <  1 . 

j->oo  jd  1 


5.3  Show  (5.4). 


5.5  Problems  and  Solutions 
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Hint:  Use  properties  of  the  Gamma  function 


/0°°  f  1  e  fdt ,  x  >  0 


r(x)  =  <  oo  ,  x  =  0 

[  r(x+  l)/x,  x  <  0  ,x  ^  —  1,  —  2, . . . 


(5.18) 


5.4  Establish  the  following  expression  for  the  autocovariance  y(h)  of  fractionally 
integrated  noise: 


2  r(l-2d)r(d  +  h) 

y(h)  =  or  - . 

7  r(j)r( l  -  d)r{\  -d  +  h) 


(5.19) 


Hint:  Use  Proposition  4.1  with  (5.11),  and  apply  the  following  identity  from 
Gradshteyn  and  Ryzhik  (2000,  3.631.8): 


71 


where  v  >  0  . 


o 


5.5  Show  Proposition  5.1  (a). 

Hint:  Use  (5.19). 

5.6  Show  Proposition  5.1  (b). 

Hint:  Use  (5.19). 

5.7  Show  Proposition  5.1  (c). 

5.8  Consider  the  ARFIMA(0,J,1)  model  Adxt  —  B(L)  su  B(L)  =  1  +bL.  Show  for 
this  special  case  that  the  proportionality  constant  from  Proposition  5.3  (b)  is  indeed 
Ydfe(0)  2tt. 

Solutions 

5.1  We  define  the  p-series  of  the  first  /  terms, 
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First,  we  discuss  the  case  separating  the  convergence  region  from  the  divergent  one 
(p  =  1).  For  convenience  choose  J  —  2l\  such  that 


*-(!)=  I  1  +  ^)  +  G  +  i  l  + 


+ 


)n—  1 


+  l 


+ 


+  i 

2n 


1  2  4 

>  2  +  4  +  8  + 


+ 


\n—  1 


)n 


n 

2 


Since  the  lower  bound  of  (1)  diverges  to  infinity  with  n ,  Sj(  1)  must  diverge,  too. 

Second,  consider  the  case  where  p  <  1,  such  that  jp  <  j.  or  j  p  >  j  \  Hence, 
Sj(p)  >  Sj(  1),  and  divergence  of  57(1)  implies  divergence  for  p  <  1. 

Third,  for  p  >  1,  we  group  the  terms  for  /  =  2n  —  1  as  follows. 


^2,7-l  (p)  —  l+(~  +  ~l+,,,+ 


1 


2 p  3  p 

2  4 

<  1  T  —  T  —  T 
2 p  4 p 


(2  n~l)P 


+  ...  + 


1 


( 2n  -  1  )p 


in— 1 


+ 


(2n~l)P 


1  1 

=  1  +  +  — T  +  •••  + 


1 


2 p 


-i  4/2-1 


n— 1^»— 1 


(2n-i)/2 


We  now  abbreviate  g  =  with  0  <  g  <  1  since  p  >  1 .  Consequently, 


^2?7-l(p)  <  1  +  &  +  g2  +  • 


+  1  = 


i  - 


8 


< 


oo  . 

^-rb- 

i= 0  6 


2 P 


-1 


2P~l  -  1  ’ 


where  we  use  the  geometric  series,  see  Problem  3.2.  Hence,  S2«-i(p)  is  bounded 
for  every  n ,  while  growing  monotonically  at  the  same  time,  which  establishes 
convergence  for  p  >  1.  Hence,  the  proof  of  (5.1)  is  complete. 

We  want  to  add  a  final  remark.  While  convergence  is  ensured  for  p  >  1,  an 
explicit  expression  for  the  limit  is  by  no  means  obvious,  and  indeed  only  known 
for  selected  values.  For  example,  for  p  —  2  one  has  the  famous  result  by  Leonhard 
Euler: 


oo  ^ 

lim  5,(2)  =  1 

J—>oo  7Z 

7=1  J 
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5.2  Since  the  limit  of  the  ratio  of  interest  is  indeterminate,  we  apply  L’ Hospital’s 
rule: 


lim 


g‘ 


=  lim 


J 


1  -d 


—  lim 


( 1  -  d)j 


—d 


j^oojd  \  j^oo  g  J  j^oo  g  J  log(g)(-l) 

d  —  1  gj 

—  - 7—  lim  —  =  0  ,  d  >  0  . 

l0 g(g)  J^oo  f 


If  —1  <  d  <  0,  the  expression  gj /jd  is  still  indeterminate.  Application  of 
L’ Hospital’s  rule  twice,  however,  yields: 


gJ 

lim  — 

j— >00  jd 


—  lim 


J 


—d 


—  lim 


-dj 


-d- 1 


f~* 00  8  j  J-*°°  8  1  log(g)(-l) 


d 


lim 


log(g)  j^-oo  jl+d 


=  0. 


This  establishes  the  claim. 

5.3  Prior  to  solving  the  problem,  we  review  some  useful  properties  of  the 
Gamma  function  that  is  often  employed  to  simplify  manipulations  with  binomial 
expressions,  see  e.g.  Sydsaeter,  Strpm,  and  Berck  (1999,  p.52),  and  in  much  greater 
detail  Gradshteyn  and  Ryzhik  (2000,  Sect.  8.31),  or  Rudin  (1976,  Ch.  8)  containing 
proofs.  For  integer  numbers,  F  coincides  with  the  factorial, 

r(n  +  1)  =  n  (n  —  1)  •  •  •  2  =  n\ , 

which  implies  a  recursive  relation  holding  in  fact  in  general: 

F(jt+  1)  =xr(x).  (5.20) 

Hence,  obviously  F(  1)  =  F( 2)  =  1,  and  a  further  value  often  encountered  is 
F(0.5)  =  y/lt.  The  recursive  relation  further  yields  the  rate  of  divergence  at  the 
origin, 


r(x)  ~  v  1  ,  v  — >►  0  , 

which  justifies  the  convention  F(0)  /  F( 0)  =  1 .  Finally,  we  want  to  approximate  the 
Gamma  function  for  large  arguments.  Remember  Stirling’s  formula  for  factorials, 
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It  generalizes  to 


r(x)  ~ 


for  x  ->  oo,  which  again  has  to  be  read  as 


lim  r(x) 


Consequently,  we  have  for  finite  v  and  y  and  large  integer  n  that 


r(n  +  v) 
r(n  +  y ) 


~  n 


x-y 


n  ->  oo . 


(5.21) 


Now,  we  turn  to  establishing  (5.4).  Repeated  application  of  (5.20)  gives 


F(r'.  jf  =  O'  -  d  -  1)  (;  -  d  -  2)  ■  ■  ■  (-d) . 
r(-d) 

By  definition  of  the  binomial  coefficients  we  conclude  from  (5.3)  with  r(j  +  1)  =  j\ 
that 


*j  = 


r  (j-d) 


r  (/'  +  i)  r  (-J)  ’ 


7  —  0 


J 


-d- 1 


r  (- d)  ’ 


j 


oo , 


where  the  approximation  relies  on  (5.21).  This  is  the  required  result. 
5.4  With  Proposition  4.1  and  (5.11)  we  compute 


2 


J /(A)  cos(\h)d\ 
o 


— d 


7T_ 

2 


71 


/ 


2crz  /  sin  2J(v)cos(2 hx)dx, 


o 


e 


X 


7Use  (1  +  x/nf 
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n  r\ 

where  we  substituted  j  =  x.  Both  sin  (v)  and  cos (2 hx)  are  even  symmetric  around 
| ,  such  that 


7T 


71 


/ 


sin  2d(x)  cos(2hx)dx  —  2  I  sin  za(x)  cos(2hx)dx 


=2/si] 


-2d, 


0 


0 


Hence,  the  integration  formula  3.631.8.  from  Gradshteyn  and  Ryzhik  (2000)  can  be 
applied  with  v  —  1  —  2d  >  0  as  long  as  d  <  0.5  and  a  —  2h: 


y(h)  = 


4  da2 


71 


71 


/si, 


sin  2d  (v)  cos (2 hx)dx 


o 


4  da 2 


—  O' 


—  O' 


n  cos(/z7i)F(2  —  2d) 

7“  2~2d(l-2d)r(l-d  +  h)r(l-d-h) 

,  (-i)hr(2-2d) 

(i  - 2d)r(i  -d  +  h)r( l  -d-h) 

,  (-i)^r(i  -2J) 

r(i  -  j  +  /?)r(i  -  d-h) ' 


Using  (5.20)  once  more,  one  can  show  that 


r(d  +  h)  ,  h  r(i-d) 


rid) 


=  (-i  y 


r(  l  -  d-h) 


Therefore,  we  finally  have 


2  r(l-2J)r(J  +  /z) 

y(/z)  =  a  - , 

7  r(d)r(\  -  d)r(\  -  d  +  h) 

which  is  (5.19). 

5.5  With  (5.19)  we  obtain 


y(0',d) 


r(  i  -  2d) 

(ni-^0)2’ 


where  we  assumed  a2  =  1  without  loss  of  generality.  Instead  of  the  variance,  we 
will  equivalently  minimize  the  natural  logarithm  thereof.  We  determine  as  derivative 


31og(y(0;d)) 


fr'(\-2d) 

v  r(  1  -  2d) 


r'(\ -d)\ 

n  i  -d)j 


-2  0/(1  -  2d)  -  f(\-d))  , 


dd 
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where  the  so-called  psi  function  is  defined  as  logarithmic  derivative  of  the  Gamma 
function, 


9  log(rQ)) 
dx 


The  psi  function  is  strictly  increasing  for  v  >  0,  which  can  be  shown  in  different 
ways,  see  e.g.  Gradshteyn  and  Ryzhik  (2000,  8.362.1).  Consequently, 


(  >  0 ,  0  <  d  <  0.5 

i/(  \  -  d)  -  i/r(l  —  2d)  \  =  0  ,  d  —  0 

[  <  0  ,  d  <  0 


which  proves  that  log  (y(0;  d )),  and  hence  y(0;  d ),  takes  on  its  minimum  for  d  —  0. 
This  solves  the  problem. 

5.6  The  recursive  relative  for  y(h)  is  obvious  with  (5.19)  and  (5.20)  at  hand. 

For  h  ^  o o,  the  approximation  in  (5.21)  yields 


/* 

y(h)  ~ 


r( l  -  id) 

r(d)r(\  -d) 


h 


2d-l 


which  defines  the  constant  from  Proposition  5.1  (b): 


r(  1  -  2d) 

Yd  ~  r(d)r(i  -  d)  ‘ 

The  Gamma  function  is  positive  for  positive  arguments  and  negative  on  the  interval 
(—1,0).  Hence,  the  sign  of  yd  equals  the  sign  of  r(d)  since  d  <  0.5,  which 
completes  the  proof  of  Proposition  5.1  (b). 

5.7  In  terms  of  autocorrelations  the  recursion  from  Proposition  5.1  (b)  becomes 


p(h;  d)  —  f(h;  d)  p(h  —  1 ;  d) ,  h  >  1 , 


where  the  factor  f(h;  d)  is  positive  for  d  >  0, 


,  x  h  -  1  +  d  df(h ;  d) 

f(h ;  d)  —  — 7 - 7 —  >  0  with  — — —  >  0, 


h  —  d 


dd 


such  that  p(h;d)  >  0  since  p(0;d)  =  1.  Hence,  we  have 


Mh- d)  Sm  d)  „(*  -  1 ;  d)  +  m  d)^U,~h  d) 


dd 


dd 


dd 


References 
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which  is  positive  by  induction  since 

dp(Ud)  =  9/(1;  d)  ^ 
dd  M  > 

Hence,  p(h;d)  is  growing  with  d  as  required. 

5.8  The  spectral  density  of  the  MA(1)  component  et  —  st  +  bst-\  is  known  from 
Example  4.3: 

fe(  0)  =  (1  +  b)202 /llT. 

We  now  express  x,  in  terms  of  a  fractional  noise  called  yt  —  A~ds 

xt  =  A~dB(L)s,  =  B{L)A~ds, 

=  (1  +  bL)y, . 

Let  yx(h)  and  yy(h)  denote  the  autocovariances  of  {xt}  and  j  v, },  respectively.  It  holds 
that 


yx(h)  =  E  [(y,  +  by,- 1)  ( yt+h  +  byt+h-X)\ 

—  (1  +  b2)yy(h)  +  b  (yy(h  —  1)  +  yy(h  +  1)) 

With  the  behavior  of  yy(h)  from  Proposition  5.1  it  follows 


Yx(h) 

h2d~ 1 


->  (1  +  b2)  yda2  +  2  b  yda ■ 

=  Ydfe(0)  2?T  , 


as  required. 
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Processes  with  Autoregressive  Conditional 
Heteroskedasticity  (ARCH) 


6.1  Summary 

In  particular  in  the  case  of  financial  time  series  one  often  observes  a  highly 
fluctuating  volatility  (or  variance)  of  a  series:  Agitated  periods  with  extreme 
amplitudes  alternate  with  rather  quiet  periods  being  characterized  by  moderate 
observations.  After  some  short  preliminary  considerations  concerning  models  with 
time-dependent  heteroskedasticity,  we  will  discuss  the  model  of  autoregressive 
conditional  heteroskedasticity  (ARCH),  for  which  Robert  F.  Engle  was  awarded 
the  Nobel  prize  in  the  year  2003.  After  a  generalization  (GARCH),  there  will 
be  a  discussion  on  extensions  relevant  for  practice.  Throughout  this  chapter, 
the  innovations  or  shocks  {£?}  stand  for  a  pure  random  process  as  defined  in 
Example  2.7. 


6.2  Time-Dependent  Heteroskedasticity 

The  heteroskedasticity  allowed  for  here  is  modeled  as  time-dependent  volatility  by1 

xt  =  atst ,  £t  ~  iid (0 , 1) ,  (6.1) 


^he  following  equation  could  be  extended  by  a  mean  function,  e.g.  of  a  regression-type, 


xt  =  a  +  Pzt  +  <Jt  £t , 


or 


xt  —  a \xt—\  +  •  •  •  +  apxt—p  +  <rt  st . 

We  restrict  our  exposition  and  concentrate  on  modeling  volatility  exclusively,  although  in  practice 
time-dependent  heteroskedasticity  is  often  found  with  regression  errors. 
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where  {07}  is  the  volatility  process  being  stochastically  independent  of  {57}  by 
assumption,  and  {57}  is  a  pure  random  process  with  unit  variance.  There  exist  two 
routes  to  model  the  volatility  process.  The  first  one,  often  labeled  as  processes 
with  stochastic  volatility,  assumes  an  unobserved,  or  latent,  process  {ht}  behind  the 
volatility:  07  =  exp(/^/ 2).  This  implies  for  the  squared  data 

x;  =  eh's2  or  log(xf)  =  h,  +  log(e2) . 

For  an  early  survey  on  stochastic  volatility  processes  see  Taylor  (1994).  A  second 
strand  in  the  literature  assumes  that  07  depends  on  observed  data,  in  particular 
on  past  observations  xt-j.  This  class  of  models  has  been  called  autoregressive 
conditional  heteroskedasticity  (ARCH).  ARCH  processes  are  widely  spread  and 
successful  in  practice  and  will  be  the  focus  of  attention  in  the  present  chapter. 


Heteroskedasticity  as  a  Function  of  the  Past 

In  this  chapter  the  variance  function  is  modeled  by  the  observed  past  of  the  process 
itself: 


CT,2  =  /  (Xf—i ,  xt-2,  ■■■)  .  (6.2) 

By  plugging  in  xt~j  from  (6.1)  one  obtains: 

of  —  /  (tff-lSf-l*  Gt-2£t-2i  '  •  *)  • 

We  will  show  that  the  process  from  (6.1)  is  a  martingale  difference  sequence. 
Remember  the  definition  of  the  information  set  Tt-\  generated  by  the  past  of  xt 
up  to  xt-\ .  Then  it  holds  that 


E  (xt  |  Tt-\)  =  E(xt\xt-i,xt-2r--) 

=  E (crt£t  \xt-\,xt-2,  •  •  •) 

=  0tE(et  \xt-i,xt-2,"-) 

=  ortE  (et)  , 

as  st  is  independent  of  Xt-j  for  j  >  0  by  construction.  With  et  being  zero  on  average, 
it  follows  that 


E  (x,  |  I,_i)  =  0 , 


2 When  conditioning  on  Tt-\,  one  often  writes  E  (•  |  xt—\,  xt—2,  •  •  • )  instead  of  E  (•  |  Tt-\). 
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which  proves  (integrability  of  {vj  assumed)  that  {xt}  is  in  fact  a  martingale 
difference  sequence,  see  below  (2.11).  The  variance  of  the  martingale  difference 
is  determined  in  the  following  way  (as  xt  is  zero  on  average, 

Var  (xt)  =  E  (x2 ) 

=  E  (<x2e2) 

-  E  (a,2)  E  («?) 

=  E  (a,2)  , 

as  of  from  (6.2)  and  et  (with  variance  1  from  (6.1))  are  stochastically  independent. 
Hence,  the  following  proposition  is  verified. 

Proposition  6.1  (Heteroskedastic  Martingale  Differences)  Let  {vr}  be  from  (6.1) 
and  {af}  from  (6.2)  with  E(af)  <  oo  independent  of  {sT}.  Then  {xt}  is  a  martingale 
difference  sequence  with  variance 

Var(xt )  =  E(x2t)  —  E(o 2)  . 

Let  us  remember  Proposition  2.2.  Due  to  the  martingale  difference  property  it 
holds  that 


E(x,)  =  0  and  y(h)  =  E (xtxt+h)  =  0 ,  0 . 

Hence,  the  process  is  serially  uncorrelated  with  expectation  zero  which  would  be 
supposed  e.g.  for  returns.  However,  the  process  is  generally  not  independent  over 
time.  The  (weak)  stationarity  of  {xt}  depends  on  the  possibly  variable  variance;  if 
the  variance  Var(vf)  is  constant,  then  the  entire  process  is  stationary. 


Heuristics 

Now,  the  question  is  how  the  functional  dependence  in  (6.2)  should  be  specified 
and  parameterized.  Heteroskedasticity  as  an  empirical  phenomenon  has  been  known 
to  observers  on  financial  markets  for  a  long  time.  Before  ARCH  models  were 
introduced,  it  had  been  measured  by  moving  a  window  of  width  B  through  the  data 
and  averaging  over  the  squares: 


1 

B 


B 


i=  1 


For  every  point  in  time  t  one  averages  over  the  past  preceding  B  values  in  order  to 
determine  the  variance  in  t.  In  doing  so,  we  do  not  center  xt-i  around  the  arithmetic 
mean  as  we  think  of  returns  with  E(v?)  =  0  when  applying  the  procedure,  cf. 
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Proposition  6.1.  With  daily  observations  (with  five  trading  days  a  week)  one  chooses 
e.g.  B  —  20  which  approximately  corresponds  to  a  time  window  of  a  month.  A  first 
improvement  of  the  time-dependent  volatility  measurement  is  obtained  by  using  a 
weighted  average  where  the  weights,  gi  >  0,  are  not  negative: 

B  B 

*?  =  !>*?-,•  with  (6-3) 

1=1  i=  1 


Example  6.1  ( Exponential  Smoothing )  For  the  weighting  function  gt  one  often  uses 
an  exponential  decay: 


A1’-1 

1  +  A  +  . . .  +  XB~l 


with  0  <  A  <  1. 


Note  that  the  denominator  is  just  defined  such  that  it  holds  that  Ylf=i  Si  —  1-  With 
growing  B  one  furthermore  obtains 


1  +  A  +  . . .  +  XB  1 


1  -  XB  1 

1  -A  1  -A' 


Inserting  the  exponentially  decaying  weights  in  (6.3),  we  get  the  following  result 
for  B  ->  oo: 


oo 

S?(A)  =  (1  —  A)  ^  A'-1  xf_i. 

i=  1 

Now  it  is  an  easy  exercise  to  verify  the  following  recursive  relation: 

s2(  A)  =  (1  —  X)x2_x  +  \s2t_x(X).  (6.4) 

We  will  call  s2( A)  the  exponentially  smoothed  volatility  or  variance.  In  order  to  be 
able  to  calculate  it  for  t  =  2, . . . ,  n,  we  need  a  starting  value.  Typically,  ^(A)  =  x\ 
is  chosen  which  leads  to  ^(A)  =  x2.  ■ 

The  ARCH  and  GARCH  processes  which  are  subsequently  introduced  are 
models  leading  to  volatility  specifications  which  generalize  s2  and  s2( A)  from  (6.3) 
and  (6.4),  respectively. 


6.3  ARCH  Models 

So-called  autoregressive  conditional  heteroskedasticity  models  can  be  traced  back 
to  Engle  (1982).  We  consider  the  case  of  the  order  q  and  specify  g2  from  (6.2)  as 
follows: 


G2  —  Q?o  ~\~  Ot\  X2_i  +  .  .  .  +  Otq  X2_q  , 


(6.5) 
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where  it  is  assumed  that 

oto  >  0  and  a*  >  0  ,  i  —  1 , . . . ,  q  ,  (6.6) 

in  order  to  guarantee  07  >  0.  Note  that  this  variance  function  corresponds  to 
from  (6.3).  For  a  1  =  •  •  •  =  aq  =  0,  however,  the  case  of  homoskedasticity  is 
modeled. 


Conditional  Moments 


Given  xt-\ , . . .  ,xt~q,  one  naturally  obtains  zero  for  the  conditional  expectation  as 
ARCH  processes  are  martingale  differences,  see  Proposition  6.1.  For  the  conditional 
variance  it  holds  that 


Var  (xt  |  xt- 1 , . . . ,  xt-q )  —  E{x2t\  xt-i , . . 


Xt- 1, 


as  st  is  again  independent  of  xt-j  and  has  a  unit  variance.  Hence,  for  the  variance  it 
conditionally  holds  that: 


Var(x,|xr_i , . . . ,  x,-q)  —  a0  +  a,\ xt_x  +  . . .  +  aq  xt_q  , 

which  explains  the  name  of  the  models:  The  conditional  variance  is  modeled 
autoregressively  (where  “autoregressive”  means  in  this  case:  dependent  on  the  past 
of  the  process).  Thus,  extreme  amplitudes  in  the  previous  period  are  followed 
by  high  volatility  in  the  present  period  resulting  in  so-called  volatility  clusters. 
If  the  assumption  of  normality  of  the  innovations  is  added,  then  the  conditional 
distribution  of  xt  given  the  past  is  normal  as  well: 


xt 


Xt-l ,  .  .  .  Xt-q  ~  J\f  (0,  ao  +  0i\  x}- 1  +  .  .  .  +  OiqX^_q) 


But,  although  the  original  work  by  Engle  (1982)  assumed  st  ~  iiAf(0, 1),  the 
assumption  of  normality  is  not  crucial  for  ARCH  effects. 


Stationarity 

By  Proposition  6.1  it  holds  for  the  variance  that 

Var(vr)  =  E (of)  =0^0+  aiE^^)  H - F  aqE(x}_q) . 

If  the  process  is  stationary,  Var(v?)  =  Var(xr_7),  j  —  1, ...  ,q,  then  its  variance 
results  as 

_ <*o _ 

1  —  Oi\  —  •  •  •  —  Oiq 


Var(vr)  = 
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For  a  positive  variance  expression,  this  requires  necessarily  (due  to  oto  >0): 

1  —  oq  —  . . .  —  aq  >  0  . 

In  fact,  this  condition  is  sufficient  for  stationarity  as  well.  In  Problem  6.1  we 
therefore  show  the  following  result. 

Proposition  6.2  (Stationary  ARCH) 

Let  {xt}  be  from  (6.1)  and  {07}  from  (6.5)  with  (6.6).  The  process  is  weakly 
stationary  if  and  only  if  it  holds  that 

q 

Ctj  <  1  . 

j=  1 


Correlation  of  the  Squares 

We  define  et  —  x2—o2.  Due  to  Proposition  6. 1  the  expected  value  is  zero:  E(et)  =  0. 
Adding  x2t  to  both  sides  of  (6.5),  one  immediately  obtains 

x2  —  Oi 0  +  0i\  %t_  1  +  .  .  .  +  Oiq  X2_q  +  et  . 

From  this  we  learn  that  an  ARCH(g)  process  implies  an  autoregressive  structure 
for  the  squares  {x2}.  The  serial  dependence  of  an  ARCH  process  originates  from 
the  squares  of  the  process.  Because  of  >  0,  x2  and  x2_t  are  positively  correlated 
which  again  allows  to  capture  volatility  clusters. 

In  Figs.  6.1  and  6.2  ARCH(l)  time  series  of  the  length  500  are  simulated.  For 
this  purpose  pseudo-random  numbers  are  generated  as  normally  distributed  and 
oiQ  =  1  is  chosen.  The  effect  of  aq  is  now  quite  obvious.  The  larger  the  value 
of  this  parameter,  the  more  obvious  are  the  volatility  clusters.  Long  periods  with 
little  movement  are  followed  by  shorter  periods  of  vehement,  extreme  amplitudes 
which  can  be  negative  as  well  as  positive.  These  volatility  clusters  become  even 
more  obvious  in  the  respective  lower  panel  of  the  figures,  in  which  the  squared 
observations  {x2}  are  depicted.  Because  of  the  positive  autocorrelation  of  the 
squares,  small  amplitudes  tend  again  to  be  followed  by  small  ones  while  extreme 
observations  appear  to  follow  each  other. 


Skewness  and  Kurtosis 

In  the  first  section  of  the  chapter  on  basic  concepts  from  probability  theory  we  have 
defined  the  kurtosis  by  means  of  the  fourth  moment  of  a  random  variable  and  we 
have  denoted  the  corresponding  coefficient  by  For  72  >  3  the  density  function 
is  more  “peaked”  than  the  one  of  the  normal  distribution:  On  the  one  hand  the 
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Fig.  6.1  ARCH(l)  with  Qf0  =  1  and  a\  =  0.5 
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Fig.  6.2  ARCH(l)  with  of0  =  1  and  a\  =0.9 


values  are  more  concentrated  around  the  expected  value,  on  the  other  hand  there 
occur  extreme  observations  in  the  tail  of  the  distribution  with  higher  probability 
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(“fat- tailed  and  highly  peaked”).  For  stationary  ARCH  processes  with  Gaussian 
innovations  (st  ~  ii  A/\0, 1))  it  holds  that  the  kurtosis  exceeds  3  (provided  it  exists 
at  all): 


Y2>  3 . 


The  corresponding  derivation  can  be  found  in  Problem  6.2.  Due  to  this  excess 
kurtosis,  ARCH  is  generally  incompatible  with  the  assumption  of  an  unconditional 
Gaussian  distribution. 

We  define  the  skewness  coefficient  y\  similarly  to  the  kurtosis  by  the  third 
moment  of  a  standardized  random  variable.  The  skewness  coefficient  of  ARCH 
models  depends  on  the  symmetry  of  et.  If  this  innovation  is  symmetric,  then  it 
follows  that  E(e, )  =  0.  Hence,  y\  —  0  follows  for  the  corresponding  ARCH  process 
(due  to  independence  of  ot  and  st): 

E(a|)  =  E(af3)  •  E(e3) 

=  E(ct,3)  -0  =  0. 


Thereby  it  was  only  used  that  st  is  symmetrically  distributed. 

Example  6.2  (ARCH(l))  In  particular  for  a  stationary  ARCH(l)  process  with 
a\  <  |  and  Gaussian  innovations  st  the  kurtosis  is  finite  and  it  results  as  (see 
Problem  6.3): 


Y2 


3 


1  —  ot\ 

1  —  3  a\ 


>  3. 


r\  1 

For  a  i  >  5  there  occur  extreme  observations  with  a  high  probability  such  that  the 
kurtosis  is  no  longer  constant.  Consider  a  stationary  ARCH(l)  process  (ct\  <  1)  with 
af  =  1/3.  Under  this  condition  one  has  for =  E(v^)  withE(jn^_1)  =  ao/(l—Qti) 
assuming  Gaussianity: 


Hij  =  E(ef)E(af4)  =  3E  ((a0  +  a lX^)2) 


+  Cti/14't- 1 


—  C  +  3af/J.4  r-i  —  C  +  , 


where  the  constant  c  is  appropriately  defined.  Continued  substitution  yields  thus 


l^A,t  ~  Ct  +  /Z4,o  • 


Hence,  we  observe  that  the  kurtosis  grows  linearly  over  time  if  3  a  f  =  1 .  ■ 
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6.4  Generalizations 

Some  extensions  of  the  ARCH  model  having  originated  from  empirical  features  of 
financial  data  will  be  covered  in  the  following. 


GARCH 


In  practice,  with  many  financial  series  it  can  be  observed  that  the  correlation  of 
the  squares  reaches  far  into  the  past.  Therefore,  for  an  adequate  modeling  a  large 
q  is  needed,  i.e.  a  large  number  of  parameters.  A  very  economical  parametrization, 
however,  is  allowed  for  by  the  GARCH  model. 

Generalized  ARCH  processes  of  the  order  p  and  q  were  introduced  by  Bollerslev 
(1986)  and  are  defined  by  their  volatility  function 


a,2  =  a0  +  <*1  x2_j  H - ha, x2_q  +  ^  H - \-  af_p.  (6.7) 


The  result  process  is  abbreviated  as  GARCH (p,  q).  In  addition  to  the  parameter 
restrictions  from  (6.6)  it  is  required  that 


Pi>  0,  i  =  1, . . .  ,p. 


(6.8) 


Jointly,  these  restrictions  are  clearly  sufficient  for  of  >  0  but  stricter  than  necessary. 


Substantially  weaker  assumptions  were  derived  by  Nelson  and  Cao  (1992). 

We  adopt  the  stationarity  conditions  for  GARCH  models  from  Bollerslev  (1986, 
Theorem  1).  The  resulting  variance  will  be  determined  in  Problem  6.4.  Thus,  we 
obtain  the  following  results. 

Proposition  6.3  (Stationary  GARCH) 

Let  {xt}  be  from  (6.1)  and  {a?}  from  (6.7)  with  (6.6)  and  (6.8).  The  process  is 
weakly  stationary  if  and  only  if 


Then  it  holds  for  the  variance  that 
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It  can  be  shown  that  the  stationary  GARCH  process  can  be  considered  as  an 
ARCH(oo)  process.  Under  the  conditions  from  Proposition  6.3  it  holds  that  (see 
Problem  6.5): 


oo 


oo 


(6.9) 


Thus,  the  GARCH  process  allows  for  modeling  an  infinitely  long  dependence  of  the 
volatility  on  the  past  of  the  process  itself  with  only  p  +  q  parameters  although  this 
dependence  decays  with  time  (i.e.  y*  ->  0  for  i  ->  oo).  The  fact  that  GARCH  can  be 
considered  as  ARCH(oo)  has  the  nice  consequence  that  results  for  stationary  ARCH 
processes  also  hold  for  GARCH  models.  In  particular,  GARCH  models  are  again 
special  cases  of  processes  with  volatility  (6.2)  and  therefore  examples  of  martingale 
differences,  i.e.  Proposition  6.1  holds  true.  If  we  assume  a  Gaussian  distribution  of 
{ef},  it  follows,  just  as  for  the  ARCH(g)  process  of  finite  order,  that  the  skewness  is 
zero  and  that  the  kurtosis  exceeds  the  value  3. 

Example  6.3  (GARCH(  1,1))  Consider  the  GARCH(1,1)  case  more  explicitly.  It 
is  by  far  the  most  frequently  used  GARCH  specification  in  practice.  Continued 
substitution  shows  under  the  assumption  of  stationarity  that  (oq  +  fii  <  1): 


oo 


Hence,  we  have  an  explicit  ARCH(oo)  representation  of  GARCH(1,1).  Assuming 
that 


1  -  (a i  +  fix)2  -  2a\  >  0, 


the  kurtosis  is  defined.  In  Problem  6.6  we  show  (with  Gaussian  distribution  of  {£?}): 


1  -  (ai  +  Pi)2 

1/9  =  3  - 

1  -  (<*1  +  Pi)2  -  2a\ 


Furthermore  one  shows  by 


of  =  a0+aix2_x+P  ierf_i 


x2  =  a0  +  (a  1  +  Pi)  x2_{  -  Pi  (x2_i  -  a2_{)  +  x2  -  of 


the  equation 


x2  —  a0  +  (a  1  +  Pi)  x]_x  +e,~  Pi  et-i 


6.4  Generalizations 
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with 


e,  =  x2  -  a?  ,  E(e,)  =  0 . 

The  GARCH(1,1)  process  {xj  therefore  corresponds  to  an  ARM A(  1,1)  structure  of 
the  squares  {. x2t }.  ■ 

In  Figs.  6.3  and  6.4  the  influence  of  the  sum  of  the  parameters  ot\  +  /3\  is 
illustrated  by  means  of  simulated  GARCH(1,1)  observations.  We  therefore  fix 
oio  =  1  and  ot\  =  0.3  and  vary  /3\  in  such  a  way  that  stationarity  is  ensured. 
The  larger  f}\  (and  therefore  the  sum  of  ct\  +  /3\),  the  more  pronounced  is  the 
change  from  quiet  periods  with  little  or,  in  absolute  value,  moderate  amplitudes  to 
excited  periods  in  which  extreme  amplitudes  follow  each  other.  Again,  this  pattern 
of  volatility  becomes  particularly  apparent  with  the  serially  correlated  squares  in  the 
lower  panel,  respectively. 


IGARCH 

Considering  the  volatility  of  GARCH(1,1), 

cr,2  =  a0  +  ai  x2_!  +  pi  ct,2_!  , 


Fig.  6.3  GARCH(l.l)  witha0  =  1,  a,  =  0.3  and  p,  =  0.3 
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Fig.  6.4  GARCH(1,1)  with  Qf0  =  1,  <*i  =  0.3  and  =  0.5 


we  are  reminded  of  sj (a)  from  (6.4).  The  difference  being  that  ao  =  0,  and  it  holds 
that  ol\  +  Pi  =  1  (i.e.  a\  =  1  —  A  and  /3i  =  A,  respectively).  Models  with  such  a 
restriction  violate  the  stationarity  condition  (ot\  +  fi\  <  1).  This  can  be  shown  when 
forming  the  expected  value  of  of  with  x2t_x  —  o2_x  s2_x : 


E (a,2)  =a0+ai  E(af_i)E (sj^)  +  P\E(af_{) 
=  a0  +  (ai  +  P\)E(af_x). 


With  a\  +  =  1  one  obtains 

E(of  —  of_,)  =  afo  >  0. 

In  other  words:  The  expectations  for  the  increments  of  the  volatility  are  positive  for 
every  point  in  time,  modeling  a  volatility  expectation  which  tends  to  infinity  with  t. 
This  idea  was  generalized  in  literature.  With 

q  p 

J2aJ  +  Jl  = 1 

7=1  7=1 

one  talks  about  integrated  GARCH  processes  (IGARCH)  since  Engle  and  Bollerslev 
(1986).  This  is  a  naming  which  becomes  more  understandable  in  the  chapter  on 
integrated  processes. 


6.4  Generalizations 
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observations  (alphal  =  0.3,  betal  =  0.7) 


Fig.  6.5  IGARCH(U)  with  ao  =  1,  ax  =  0.3  and  =  0.7 


In  Fig.  6.5,  an  IGARCH(1,1)  process  (oq  +  =  1)  was  simulated  according 

to  the  scheme  from  Figs.  6.3  and  6.4.  In  comparison  to  the  previous  figures,  in  this 
case  we  find  considerably  more  extreme  volatility  clusters  which,  however,  are  not 
exaggerated.  The  kind  of  depicted  dynamics  in  Fig.  6.5  can  be  frequently  observed 
in  financial  practice. 


GARCH-M 

We  talk  about  “GARCH  in  mean”3  (GARCH-M)  if  the  volatility  term  influences  the 
(mean)  level  of  the  process.  In  order  to  explain  this  with  regard  to  contents,  we  think 
of  risk  premia:  For  a  high  volatility  of  an  investment  (high-risk),  a  higher  return  is 
expected,  on  average.  In  equation  form  we  write  this  down  as  follows: 


xt  —  0  o>  +  Ut , 


(6.10) 


where  {ut}  is  a  GARCH  process: 

ut  =  <i,et ,  of  =  ao+«i  u2t_x  H - \-aq  u2t_q+fi\  +..  ,+fip  o}_p  .  (6.11) 


3Originally,  the  ARCH-M  model  was  proposed  by  Engle,  Lilien,  and  Robins  (1987). 
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observations  (theta  =  1 ) 


Fig.  6.6  GARCH(1,1)-M  from  (6.11)  wither  =  1,  a\  =  0.3  and  /fi  =  0.5 

Therefore,  in  this  case  the  mean  function  fit  is  set  to  0  at.  In  some  applications  it  has 
proved  successful  to  model  the  risk  premium  as  a  multiple  of  the  variance  instead 
of  modeling  it  by  the  standard  deviation  (6  ot)\ 

r\ 

xt  =  0  ot  +  ut . 

For  both  mean  functions  the  GARCH-M  process  {xt }  is  no  longer  free  from  serial 
correlation  for  0  >  0;  it  is  no  longer  a  martingale  difference  sequence. 

In  Fig.  6.6  a  GARCH-M  series  was  generated  as  in  (6.11).  The  volatility  cluster 
can  be  well  identified  in  the  lower  panel  of  the  squares. The  effect  of  0  =  1  becomes 
apparent  in  the  upper  panel:  In  the  series  of  xt  local,  reversing  trends  can  be  spotted. 
Upward  trends  involve  a  high  volatility,  whereas  quiet  periods  are  marked  by  a 
decreasing  or  lower  level. 


EGARCH 

We  talk  about  exponential  GARCH  when  the  volatility  is  modeled  as  an  exponential 
function  of  the  past  squares  x2r_i .  This  suggestion  originates  from  Nelson  (1991)  and 
was  made  in  order  to  capture  the  asymmetries  in  the  volatility.4  It  is  observed  that 
decreasing  stock  prices  (negative  returns)  tend  to  involve  higher  volatilities  than 


4We  do  not  exactly  present  Nelson’s  model  but  a  slightly  modified  implementation  which  is  used 
in  the  software  package  EViews. 
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increasing  ones.  This  so-called  leverage  effect  is  not  captured  by  ordinary  GARCH 
models. 

In  EViews  the  variance  for  EGARCH  is  calculated  as  follows: 

p  q 

logo-,2  =  ft)  +  J2  Pi  lc,g  A  +  F  (aj\sH\  +  Yj  £t~j) , 

j= i  j= i 

where  st-j  is  defined  as 


For  yj  —  0  the  sign  is  not  an  issue.  However,  for  yj  <  0  in  the  negative  case 

aj\st-j\  +  Yj  8,-j  =  (oij  -  Yj)\s,-j\  for  st~j  <  0 
has  a  stronger  effect  on  log  a,2  than  in  the  positive  case 

<Xj\e,-j\  +  Yj St-j  =  (otj  +  Yj)et-j  for  £,_,■>  0. 

Note  that  for  EGARCH  the  expression  a,2  is  without  parameter  restrictions  always 
positive  by  construction.  Applying  the  exponential  function,  it  results  that 


q 


co 


+  Y/Pj\ogalJ  +  J2( 

7=1  7=1 


a, 


£t—j 


+  Yj  £t-j) 


Example  6.4  ( EGARCH  (1,1))  Again,  as  special  case  we  treat  the  situation  with 

p  =  q  =  i, 


log  o',2  =  to  +  p  1  loga,2_!  +  qti  |ef— i 


+  y  i  £t- 1  > 


or  after  applying  the  exponential  function: 


9  /? 

of  =  emot!_\  ■  exp  («i |ef_i  |  +  y i  e,_i) 


^2/h  _  (  exp(|et_i|(ai  -  yx)) , 
,_1  (  exp(e,_i(ai  +  yi)) , 


st-  i  <  0 
st-  i  >  0 


In  this  case  it  is  again  shown  that  for  y\  <  0  the  leverage  effect  is  modeled  in 
such  a  way  that  negative  observations  have  a  larger  volatility  effect  than  positive 
observations  of  the  same  absolute  value.  ■ 


In  Fig.  6.7  a  realization  of  a  simulated  EGARCH(1,1)  process  is  depicted.  The 
“leverage  parameter”  is  y\  —  —0.5.  When  the  graphs  of  the  squared  and  the  original 
observations  are  compared,  it  can  be  detected  that  the  most  extreme  amplitudes  are 
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Fig.  6.7  EGARCH(1,1)  with  co  =  1,  ax  =  0.3,  =  0.5  and  y x  =  -0.5 

in  fact  negative.  Furthermore,  it  can  be  observed  that  periods  with  predominantly 
negative  values  are  characterized  by  a  high  volatility. 

YAARCH 

The  works  of  Engle  (1982)  and  Bollerslev  (1986)  have  set  the  stage  for  a  downright 
ARCH  industry.  A  large  number  of  generalizations  and  extensions  has  been 
published  and  applied  in  practice.  Most  of  these  versions  were  published  under  more 
or  less  appealing  acronyms.  When  Engle  (2002)  balanced  the  books  after  20  years 
of  ARCH,  he  added  with  some  irony  another  acronym:  YAARCH  standing  for  Yet 
Another  ARCH.  There  is  no  end  in  sight  for  this  literature. 


6.5  Problems  and  Solutions 

Problems 

6.1  Prove  Proposition  6.2. 

Hint:  According  to  Engle  (1982,  Theorem  2)  the  process  is  stationary  if  and  only  if 
it  holds  that 


a(z)  =  0  => 


z  >  1 , 


(6.12) 


with  a(z)  :=  1  —  oc\  z  —  . . .  —  aq  zq. 


6.5  Problems  and  Solutions 
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6.2  Show  that  the  kurtosis  of  an  ARCH  process  exceeds  the  value  3.  Assume  a 
Gaussian  distribution  of  {st}  and  E(a/4)  <  oo. 

6.3  Calculate  the  kurtosis  of  a  stationary  ARCH(l)  process  as  given  in  Example  6.2 
for  the  case  that  it  exists.  Assume  a  Gaussian  distribution  of  {st}. 

6.4  Assume  {vr}  to  be  a  stationary  GARCH  process.  Determine  the  variance 
expression  from  Proposition  6.3. 

6.5  Assume  {xt}  to  be  a  stationary  GARCH  process  with  (6.6)  and  (6.8).  Determine 
the  ARCH(oo)  representation  from  (6.9). 

6.6  Calculate  the  kurtosis  of  a  stationary  GARCH(1,1)  process  as  given  in  Exam¬ 
ple  6.3  for  the  case  that  it  exists.  Assume  a  Gaussian  distribution  of  {ej. 


Solutions 

6.1  We  have  to  show  the  equivalence  of  (6.12)  and  the  condition  from  Proposi¬ 
tion  6.2,  given  (6.6).  This  condition  can  also  be  written  as  af(l)  >  0.  Hence,  we 
have  to  prove  the  equivalence: 

(6.12)  of (1)  >  0. 


We  proceed  in  two  steps. 

Under  the  condition  by  Engle  (1982)  it  holds  that 


Var(v?)  = 


C^0  _  a0  >  Q 

1  —  ot\  —  . . .  —  aq  o'  (1)  — 


Due  to  afo  >  0  it  immediately  follows  that  o'(l)  >  0.  The  case  o'(l)  =  0,  however, 
is  due  to  (6.12)  excluded,  such  that  af(l)  >  0  can  be  concluded. 

“4=”:  For  a  root  z  of  a(z)  it  holds  that: 


a 

1  =  E  “A- 

/=! 

By  the  triangle  inequality  and  applying  (6.6)  it  follows  that: 


q 

1  <  E  \aizl 

7=1 


<  max 

j 


=  max 

j 


7=1 

(1 -«(!)) 


9 


<  max 
j 
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where  the  assumption  was  used  for  the  last  inequality.  Therefore,  we  have  shown 
for  a  root  z  of  a(z)  that  it  holds  that  max,  \zj\  >  1  and  thus  \z\  >  1. 

Hence,  the  proof  is  completed. 

6.2  From  Example  2.4  we  adopt  due  to  the  Gaussian  distribution 


E (sf)  =  E 


Therefore,  in  a  first  step  it  follows  due  to  the  independence  of  o>  and  st: 

EUf  )  =  E  (of  e})  =  E  (a4)  E  (e4)  =  3  E  (a,4) . 

Hence,  because  of  E(x?)  =  0  and  Proposition  6.1  the  kurtosis  of  xt  results  as: 

E(x4)  =  3  E  (of ) 

K2  (Var(x,))2  (E  (a?))2  ' 

The  usual  variance  decomposition,  see  Eq.  (2.1), 

Var  (o',2)  =  E  (a4)  -  (E  (of))2  >  0 , 


yields 


E(°v4) 

(E  (CTr2))2 


>  1  . 


Hence,  the  claim  is  verified:  y>2  >  3. 

6.3  As  at  and  et  are  stochastically  independent,  it  holds  that 

EC**)  -  E(<r*)E(e*). 


whereby  the  k- th  central  moment  is  given,  as  {xt}  is  a  martingale  difference  sequence 
with  zero  expectation.  On  the  assumption  of  a  standard  normally  distributed  random 
process  one  obtains 


E(e,2)  =  1,  EOf)  =  0  and  E(e4)  =  3. 

This  implies  for  the  ARCH(l)  process  that  the  skewness  is  zero  due  to  E(x) )  =  0. 
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In  order  to  determine  the  kurtosis,  we  first  observe  that  the  fourth  moment  is 
constant  under  the  condition  3a2  <  1.  To  that  end  define  i±\,t  and  use  \i 2  = 


Varfe)  -  ^ 


fi4,t  =  E  (x*)  =  E(^)E(a/)  =  3E((a0  +  ot\x2t_x)2) 


+  a1/x4,f- 1 


c  +  3  1  , 


where  the  constant  c  is  defined  appropriately.  Infinite  substitution  yields 

M4 ,t  —  c  (l  +  3a2  +  (3oi2)~  +  •  •  • )  =  - — — y  =  /X4  . 

1  —  3ax 


With  a  constant  fi 4  (and  /X2)  one  obtains 

/i4  —  E(xA)  =  3  E(a/4)  =  3  E(o'q  +  2afo  ot  1  +  a2  xA_x) 

=  3 [0*5  +  2afo  ofi  /X2  +  ot\  11 4] 


or 


/X4 


3  r  ~  2oiioi\ 
«o  +  7^— 

1  —  3  af  |_  1  —  Qfi 

3  o'o  (1  +  °fi) 

1  —  3  o'j  1  —cti 


From  this  it  follows  that 

Yi  =  ,v  ^  77  =  1  3-  2(l-«i)(1  +«i) 

(Var(v?))2  l  —  3af 

1  —  O'? 

=  3 - E. 

1  —  3  a 2 

Of  course,  these  transformations  were  only  possible  for  1  —  3a2  >  0.  Hence,  this  is 
the  condition  for  a  finite,  constant  kurtosis. 

6.4  We  use  the  fact  that  07  from  (6.7)  is  again  independent  of  st.  This  can  be  shown 
by  substitution  of  cr2_j  and  x2_i  according  to  (6.1).  Thus,  as  in  Proposition  6.1,  for 
stationarity  and  for  arbitrary  points  in  time,  it  holds  that: 


Var(x,)  =  E  (a?)  =  y  (0) . 


146 


6  Processes  with  Autoregressive  Conditional  Heteroskedasticity  (ARCH) 


Hence,  by  forming  the  expected  value  we  obtain  from  (6.7): 

y(0)  =  ao  +  ci i  y(0)  +  . . .  +  aq  y(0)  +  Pi  y(0)  +  . . .  +  Pp  y(0) . 

Therefore,  we  can  solve 


ot  o 


1  - 


E  aj 

7=1 


9 


as  claimed. 

6.5  We  define  the  lag  polynomial  P  (L)  with 


P(L)  =  l  —  fix  L  —  ...  —  Pp  Lp  . 


Hence,  it  holds  that 


r\  r\  r\ 

P(L)  (7t  =  OiQ  +  Oi  l  %t—  1  +  •  •  •  +  Oiq  Xt—q  . 


By  assumption 


Pj  >  0  ,  and  P(  1)  >  0 

In  Problem  6. 1  we  have  shown  that  this  is  equivalent  to 


P(z)  =  0 


z 


>  1 . 


This  is  in  turn  the  condition  of  invertibility  known  from  Proposition  3.3  which 
guarantees  a  causal,  absolutely  summable  series  expansion  with  coefficients  {q}: 


fi(L)  ^  9  L'  ’  ^|C;' 

Pv  7  j= 0  7=0 


<  oo 


By  comparison  of  coefficients  one  obtains  from 


oo 


1  =  (1-PiL-  ...  -PpLP)J2cjLj 

7=0 


as  usual 


c0  =  1 

ci  =  P\  c0  =  Pi  >  0 
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C2  —  Pi  C\  +  @2  Co  —  Pi  +  Pi  —  0 


Cj  =  Pi  Cj- 1  +  . . .  +  Pp  Cj-P  >  0  ,  j  >  p  . 


Thus,  the  inversion  of  /3(L)  yields: 


oi  0 

W) 


+ 


a  1  X2t x  +  .  .  .  +  Oiq  x2 


q 


m 


00 

=  yo  +  E  y*  x>-i  ’ 

i=l 

where  yt,  i  >  0,  results  by  convolution  of 


ot\  L  T  ...  T  otq  Lq 

m 


q  OO 

X>*L*  E  ciLi- 


k=  1  j=  1 


The  non-negativity  and  summability  of  {a&}  and  {cj}  is  conveyed  to  the  series  {]//}. 
Hence,  the  proof  is  complete. 

6.6  As  for  the  ARCH(l)  case  it  holds  that 


IM  =  E  (4)  =  3  E  (ct4) 


Applying E (xf)  =  E (a,2)  =  1_“0_/gi  yields: 

E  (a,4)  =  E  ([a0  +  ai  x2_,  +  fa  a,2.;]2) 

=  E  (ao  +  «i  Ai  +  £1  ^-1) 

+  E  (2  a0  a  1  x2_!  +  2  a0  $1  of-i  +  2  ai  fa  x2t_x  ct,2^) 
=  «o  +  3  0,2  E  (a,4-! )  +  #  E  (ct4_!  ) 


+ 


2^0  cu  1  _  2a\P\ 


+ 


1  —  afi  —  Pi  1  —  afi  —  Pi 


+  2a1/31E(a,4_1)E(£2_1) 


As  for  the  ARCH(l)  case  one  has  to  show  that  E  (cr4)  turns  out  to  be  constant  under 
stationarity  and  the  condition  1  —  2  a2  —  (ot\  +  Pi)2  >  0.  We  omit  this  step  here  and 
take  E  (a4)  =  E  (oyL , )  for  granted.  It  then  holds  that: 

2  |  2org(ofi+/3i) 

E  (a,4)  =  - °  - -  . 

1  -  2a\  -  (ai  +  fa)2 
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From  this  it  follows  that 


72  =  3  E  (a,4) 


1 

(Var  (xj))2 


_ «o  (1  +  «i  +  Pi) _  (1  -Q-i  -  Pi)2 

(1  -«i  -  j8i)(l  -2af  -  (on  +  Pi)2)  a-l 

^  (1  +  («i  +  /?i))(l  —  (o'!  +  Pi)) 

1  —  2a\  -  («i  +  Pi)2 

3  1  —  (on  +  ffi)2 

1  -  2a2  -  (a i  +  Pi)2 


This  is  in  accordance  with  the  claim. 
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Stochastic  Integrals 


Wiener  Processes  (WP) 


7.1  Summary 

The  Wiener  process  (or  the  Brownian  motion)  is  the  starting  point  and  the  basis  for 
all  the  following  chapters.  This  is  why  we  will  consider  this  process  more  explicitly. 
It  is  a  continuous-time  process  having  as  prominent  a  position  in  stochastic  calculus 
as  the  Gaussian  distribution  in  statistics.  After  introducing  its  defining  properties 
intuitively,  we  will  discuss  important  characteristics  in  the  third  section.  Examples 
derived  from  the  Wiener  process  will  conclude  the  exposition. 


7.2  From  Random  Walk  to  Wiener  Process 

We  consider  a  nonstationary  special  case  of  the  AR(1)  process  and  thereby  try  to 
arrive  at  the  Wiener  process  which  is  the  most  important  continuous-time  process  in 
the  fields  of  our  applications. 


Random  Walks 

The  cumulation  of  white  noise  is  labeled  random  walk, 

t 

xt  =  sj ,  t  e  {1,2, . . .  ,n} . 
j=  i 


^orbert  Wiener,  1894-1964,  was  a  US-American  mathematician.  He  succeeded  in  finding  a 
mathematically  solid  definition  and  discussion  of  the  so-called  Brownian  motion  named  after 
the  nineteenth  century  British  botanist  Brown.  With  a  microscope,  Brown  initially  observed  and 
described  erratic  paths  of  molecules. 
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Obviously,  it  holds  that 


xt  =  xt-\  +et,  x0  =  0 . 

In  other  words:  The  random  walk  results  as  an  AR(1)  process  for  the  parameter 
value  a  —  1  and  with  the  starting  value  zero 

xt  —  axt- 1  +  et ,  a  =  1 ,  xo  =  0 . 

As  the  process  is  nonstationary, 

E(xf)  =  0  ,  Var(vr)  =  o2  t, 

it  cannot  have  an  infinitely  long  past,  i.e.  the  index  set  is  finite,  T  =  {1,2 , ,n}. 

In  a  way,  the  random  walk  models  the  way  home  of  a  drunk  who  at  a  point  in  time 
t  turns  to  the  left  or  to  the  right  by  chance  and  uncorrelated  with  his  previous  path. 
Put  more  formally:  The  random  walk  is  a  martingale.  We  briefly  want  to  convince 
ourselves  of  this  fact.  By  substitution  the  AR(1)  process  yields 


.V—  1 

x,+s  =  asx,  +  ^  aJst+s-j  ■ 

j= o 


Therefore,  for  s  >  0  it  holds  that 


E(xt+S\lt)  =  as xt  +  0  , 


where  Xt  again  denotes  the  information  set  of  the  AR(1)  process.  Thus,  the 
martingale  condition  (2.11)  for  AR(1)  processes  is  fulfilled  if  and  only  if  a  —  1. 
The  second  martingale  condition,  E(|jk*|)  <  oo ,  is  given  as  a2  <  oo  and  hence 
E(v2)  —  to2  <  oo. 

Example  7.1  ( Discrete-Valued  Random  Walk)  Let  the  set  of  outcomes  contain  only 
two  elements  (e.g.  coin  toss:  heads  or  tails), 


Q  —  {&>o  ,  coi} , 


with  probabilities  P({<z>i})  —  \  —  P({<z>o}).  Let  {£?}  be  a  white  noise  process 
assigning  the  numerical  values  1  and  —1  to  the  events, 

s(t;  co\)  —  1 ,  s(t;a) o)  =  — 1 ,  t=l,2,...,n. 


2This  special  assumption  for  the  starting  value  is  made  out  of  convenience;  it  is  by  no  means  crucial 
for  the  behavior  of  a  random  walk. 
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For  every  point  in  time,  this  induces  the  probabilities 


p e(s,  =  1)  =  POl})  =  P£(er  =  -1)  =  P({«o})  =  . 

Then,  for  expectation  and  variance  it  is  immediately  obtained: 

E(e,)  =  0 ,  Var(e,)  =  l2  1  +  (— l)2 1  =  1 . 

For  t  =  1 , ,n,  the  corresponding  random  walk  xt  =  Yl)=  1  £j  can  only  ta^e  on 
the  countably  many  values  {— n,  —n+  1, . . . ,  n  —  1  ,n}  and  is  therefore  also  called 
discrete- valued.  ■ 

Example  7.2  (Continuous -Valued  Random  Walk)  If  is  a  Gaussian  random 
process, 


s,  ~  7V(0,ct2), 

then,  obviously,  the  random  walk  based  thereon  is  also  Gaussian,  where  the  variance 
grows  linearly  with  time: 


t 

xt  —  ^  £j  ~  Af(0,  a2t) . 

7=1 

In  this  case,  {xt}  is  a  continuous  random  variable  by  assumption  and  hence  this 
random  walk  is  also  called  continuous- valued.  ■ 


Wiener  Process 


At  this  point,  the  continuous-time  Wiener  process  will  not  yet  be  defined  rigorously, 
but  we  will  approach  it  intuitively  step  by  step.  In  order  to  do  so,  we  choose  the 
index  set  T  =  [0,  1]  with  the  equidistant,  disjoint  partition 


Now,  the  random  walk  is  multiplied  by  the  factor  1  /  *Jn  and  expanded  into  a  step 
function  Xn (t) .  Interval  by  interval,  we  define  as  continuous-time  process: 


n 


n 


,  i  —  1 , . . . ,  n  . 


(7.1) 
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In  addition,  we  assume  et  e  {—1, 1}  and  set  for  t  —  1 


xn(l)  = 


1 


n 


j=  1 


For  t  —  0,  i.e.  i  =  1  in  (7.1),  we  follow  the  convention  that  a  sum  equals  zero  if 
the  upper  summation  limit  is  smaller  than  the  lower  one,  which  is  why  X„(0)  =  0 
begins  in  the  origin.  Apparently,  Xn(t)  is  a  constant  step  function  on  an  interval  of 
the  length  l/n,  respectively;  if  Xn(t)  was  only  observed  at  the  jump  discontinuities, 
a  time-discrete  random  walk  would  be  obtained.  Being  dependent  on  the  choice  of 
n  (i.e.  the  fineness  of  the  partition),  the  process  Xn(t)  is  indexed  accordingly. 

If  st  is  from  Example  7.1,  i.e.  |£,j  =  1,  this  means  that  each  individual  step  of 
the  step  function  has  the  height  1  / *Jn  in  absolute  value.  Hence,  Xn(t)  only  takes  on 
values  from 


—n  —n  +1  n—  1  n 

n  +Jn  ^Jn  +Jn 

Therefore,  Xn(t)  is  a  continuous-time  but  discrete- valued  process. 

Now,  the  starting  point  for  the  Wiener  process  is  the  step  function  Xn(t )  with  st 
from  Example  7.1.  The  number  of  the  steps  obviously  depends  on  n  which  indicates 
the  fineness  of  the  partition  of  the  unit  interval.  Simultaneously,  the  step  height  of 
the  steps  with  n~° ,5  becomes  flatter,  the  finer  it  is  partitioned.  Note  that,  due  to  this 
fact,  the  range  becomes  finer  and  finer  and  larger  and  larger  as  n  grows.  Hence, 
with  n  growing,  Xn(t)  becomes  “more  continuous”,  in  the  sense  that  the  step  heights 
n~0-5  turn  out  to  be  smaller;  simultaneously,  the  jump  discontinuities  move  together 
more  closely  (the  steps  of  the  width  l/n  get  narrower)  such  that  Xn ( t )  can  take  on 
more  and  more  possible  values.  In  the  limit  ( n  ->  oo)  a  process  named  after  Norbert 
Wiener  is  obtained  which  we  will  always  denote  by  W  in  the  following: 

Xn(t )  =>  W(t)  forn^oo, 

where  “=>”  denotes  a  mode  of  convergence  which  will  be  clarified  in  Chap.  14. 
Intuitively  speaking,  it  holds  that  for  each  of  the  uncountably  many  points  in  time  t 
the  function  Xn(t )  converges  in  distribution  to  W(t)  just  at  this  point.  The  transition 
of  discrete-time  and  discrete- valued  step  functions  from  (7.1)  to  the  Wiener  process 
(for  n  growing)  is  illustrated  in  Fig.  7.1. 

The  Wiener  process  W(t)  as  a  limit  of  Xn(t)  is  continuous-valued  with  range 
R  =  (— oo,  oo)  and  of  course  it  is  continuous-time  with  t  £  [0,  1].  Furthermore, 
the  Wiener  process  is  a  Gaussian  process  (normally  distributed)  which  is  not  that 
surprising.  As,  due  to  the  central  limit  theorem  for  n  ->  oo,  it  holds  for  the 


3 In  order  not  to  violate  the  concept  of  functions,  strictly  speaking,  the  vertical  lines  would  not  be 
allowed  to  occur  in  the  graphs  of  the  figure.  We  ignore  this  subtlety  to  enhance  clarity. 
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Fig.  7.1  Step  function  from  (7.1)  on  the  interval  [0,1] 


standardized  sum  of  uncorrelated  random  variables  {sj}  (whose  variance  is  one  and 
whose  expectation  is  zero)  that  it  tends  to  a  standard  normal  distribution: 


Xn(l)  = 


JV(0, 1) . 


(7.2) 


Here,  denotes  the  usual  convergence  in  distribution;  cf.  Sect.  8.4.  As  Xn(l) 
tends  to  W(l)  at  the  same  time,  the  Wiener  process  has  to  be  a  standard  normally 
distributed  random  variable  at  t  —  1 .  After  giving  a  formal  definition,  we  will  again 
bridge  the  gap  from  the  Wiener  process  to  Xn(t). 


156 


7  Wiener  Processes  (WP) 


Formal  Definition 

The  Wiener  process  (WP)  W(t),t  e  [0,  T\,  is  defined  by  three  assumptions.  Put  into 
words,  they  read:  It  is  a  process  with  starting  value  zero  and  independent,  normally 
distributed,  stationary  increments.  These  assumptions  are  to  be  concretized  and 
specified.  Hence,  the  Wiener  process  is  defined  by: 

(Wl)  The  starting  value  is  zero  with  probability  one,  P(W(0)  =  0)  =  1; 

(W2)  non-overlapping  increments  W(t\)  —  W(to),  . . .,  W(tn)  —  W(tn~ i),  with  0  < 
to  <  t\  <  . . .  <  tn,  are  independent  for  arbitrary  n\ 

(W3)  the  increments  follow  a  Gaussian  distribution  with  the  variance  equalling  the 
difference  of  the  arguments,  W(t)  — W(s )  ~  Af(0,  t  —  s)  with  0  <  s  <  t. 

Note  that  the  variance  of  the  increments  does  not  depend  on  the  point  in  time  but 
only  on  the  temporal  difference.  Furthermore,  the  covariance  of  non-overlapping 
increments  is  zero  due  to  the  independence  and  the  joint  distribution  results  as 
the  product  of  the  marginal  distributions.  Hence,  the  joint  distribution  of  non¬ 
overlapping  increments  is  multivariate  normal.  If  all  increments  are  measured  over 
equidistant  constant  time  intervals,  U  —  U- \  =  const ,  then  the  variances  are  identical. 
Therefore,  such  a  series  of  increments  is  (strictly)  stationary. 

Although  the  WP  is  defined  by  its  increments,  they  translate  into  properties  of 
the  level.  Obviously,  the  first  and  the  third  property4 5  imply 

W(t)  ~  A/\0,  t) ,  (7.3) 

i.e.  the  Wiener  process  is  clearly  a  stochastic  function  being  normally  distributed  at 
every  point  in  time  with  linearly  growing  variance  t.  More  precisely,  the  WP  is  even 
a  Gaussian  process  in  the  sense  of  the  definition  from  Chap.  2.  The  autocovariances 
being  necessary  for  the  complete  characterization  of  the  multivariate  normal 
distribution  . . . ,  W(tn))',  are  determined  as  follows  (see  Problem  7.3): 

Co \(W(t),  W(s ))  =  min^,  t)  .  (7.4) 

The  Wiener  process,  which  here  will  always  be  denoted  by  W,  is  for  us  a  special 
case  of  the  more  general  Brownian  motion.  So  to  speak,  it  takes  over  the  role  of 
the  standard  normal  distribution,  and  by  multiplication  with  a  constant  a  general 

Brownian  motion  is  obtained  as 

B{t)  =  o  W(t) ,  a  >  0 . 


4To  be  completely  accurate,  this  needs  to  read:  W(t)  —  W(0)  ~  Af(0,  t).  As  TF(0)  is  zero  with 
probability  one,  we  set  TF(0)  equal  to  zero  here  and  in  the  following;  then,  the  corresponding 
statements  only  hold  with  probability  one. 

5 This  convention  does  not  hold  beyond  these  pages.  Many  authors  use  the  terms  Wiener  process 
or  Brownian  motion  interchangeably  or  they  apply  one  of  them  exclusively. 
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The  assumptions  (Wl)  to  (W3)  seem  very  natural  if  the  WP  is  accepted  as  a  limit 
of  Xn(t)  from  (7.1).  For  this  process  it  holds  by  construction  that 


•  Xn(t)  —  0  for  t  e  [0,  l/n), 

•  e.g.  the  increments 


and 


are  uncorrelated  (or  even  independent  if  is  a  pure  random  process), 

•  Xn ( 1 )  —  Xn(0)  is  approximately  normally  distributed  due  to  (7.2). 

The  three  properties  (Wl)  to  (W3)  just  reflect  the  properties  of  the  step  function 
Xn(t). 

7.3  Properties 

We  have  already  come  to  know  some  properties  of  the  WP,  for  example  its 
autocovariance  structure  and  the  Gaussian  distribution.  For  the  understanding  and 
handling  of  Wiener  processes  further  properties  are  important. 

Pathwise  Properties 

Loosely  speaking,  it  holds  that  the  Brownian  motion  is  everywhere  (i.e.  for  all  t) 
continuous  in  terms  of  conventional  calculus  ;  however,  it  is  nowhere  differentiable. 
These  are  pathwise  properties,  i.e.  for  a  given  &>0,  W(t)  —  W(t ;  ojq)  can  be  regarded 
as  a  function  which  is  continuous  in  t  but  which  is  nowhere  differentiable.  This 
is  a  matter  of  properties  being  mathematically  rather  deep  but  which  can  be  made 


6  Occasionally,  the  pathwise  continuity  is  claimed  to  be  the  fourth  defining  property.  This  is  to  be 
understood  as  follows.  Billingsley  (1986,  Theorem  37.1)  proves  more  or  less  the  following:  If  one 
has  a  WP  W  with  (Wl)  to  (W3)  at  hand,  then  a  process  W *  can  be  constructed  which  is  a  WP  in 
the  sense  of  (Wl)  to  (W3),  as  well,  which  has  the  same  distribution  as  W  and  which  is  pathwise 
continuous.  As  W*  and  W  are  equal  in  distribution,  they  cannot  be  distinguished  and  therefore, 
w.l.o.g.  it  can  safely  be  assumed  that  one  may  work  with  the  continuous  W* . 
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time 

Fig.  7.2  Simulated  paths  of  the  WP  on  the  interval  [0,1] 


heuristically  plausible  at  least.  Concerning  this,  let  us  consider 

W(t  +  h)  -  W(i)  ~  Af(0,  h),  h>  0  . 

For  /i  — >  0  the  given  Gaussian  distribution  degenerates  to  zero  suggesting 
continuity:  W(t  +  h)  —  W(t)  ^  0  for  h  ^  0.  Analogously,  we  observe  a  difference 
quotient  whose  variance  tends  to  infinity  for  /z  — >  0, 

h  \  h  J 

which  suggests  that  a  usual  derivative  does  not  exist.  Related  to  contents,  this 
means  that  it  is  not  possible  to  add  a  tangent  line  to  W(t)  which  would  allow  for 
approximating  W(t  +  h)  for  an  ever  so  small  h.  Three  simulated  paths  of  the  WP  in 
Fig.  7.2  illustrate  these  properties. 


Markov  and  Martingale  Property 

In  the  previous  section,  we  have  learned  that  the  random  walk  is  a  martingale.  For 
the  WP  as  a  continuous-time  counterpart,  a  corresponding  result  can  be  obtained 
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(where  It  —  o  ( W(r ),  r  <  t)  contains  all  the  information  about  the  past  up  to  t): 


E(|W(0|)<oo, 


E  (W(t  +  s)  |  Tt)  =  W(i) . 


The  WP  satisfies  the  Markov  property  (2.9)  as  well.  In  order  to  show  this,  we  use  the 
fact  that  the  increment  W(t  -\-  s)  —  W(t )  for  s  >  0  is  independent  of  the  information 
set  It  due  to  (W2).  Hence,  for  W(t)  —  v  it  holds  that: 


P  (W(t  I  s)  <  w  |  It)  =  P  (W(t  Is)  -  W(t) 

=  P  (W(t  Is)-  W(t) 


<  w  —  V  I  It) 

<  w  —  v)  . 


At  the  same  time  it  holds  again  due  to  independence  that: 

P  (W(t  Is)  <w  \  W(t)  =  v)  =  P  (W(t  Is)  -  W{t)  <  w  -  v  |  W{t)  =  v) 

=  P  (W(t  Is)  -  Wit)  <  w  -  v)  , 


which  just  verifies  the  Markov  property. 


Scale  Invariance 

The  Wiener  process  is  a  function  being  Gaussian  for  every  point  in  time  t  with 
expectation  zero  and  variance  t.  However,  time  can  be  measured  in  minutes,  hours 
or  other  units.  If  the  time  scale  is  blown  up  or  squeezed  by  the  factor  a  >  0,  then  it 
holds  that 


Wio  t)  ~  AT(0,  a  t) . 

The  same  distribution  is  obtained  for  the  -fold  of  the  Wiener  process: 

+JoWit)  ~  Afi0,(jt). 

That  is  why  the  Wiener  process  is  called  scale-invariant  (or  self- similar).  Hence, 
Wio  t)  and  *Jo  Wit)  are  equal  in  distribution,  which  we  formulate  as 

Vo  Wit)  -  Wiot)  (7.5) 

as  well.  Such  an  equality  in  distribution  is  to  be  handled  with  care  and  by  no  means 
to  be  confused  with  ordinary  equality.  Naturally,  it  does  not  hold  that,  e.g.  the  double 
of  Wit)  is  equal  to  the  value  at  the  point  in  time  4 1: 


Vo  Wit)  ±  Wiot). 
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In  other  words:  Scale  invariance  is  a  distributional  and  not  a  path  wise  property. 

Up  to  this  point,  it  has  not  been  emphasized  that  the  Wiener  process  is 
nonstationary.  This  can  already  be  noted  in  (7.4)  as  for  s  >  0  the  covariance 
Co W(t  +  s))  =  t  does  not  result  as  dependent  on  the  temporal  distance 
s,  but  as  dependent  on  the  point  in  time  t  itself.  As  we  have  said,  the  increments 
of  the  WP  from  (W2),  however,  are  multivariate  normal  with  expectations  and 
covariances  of  zero  and  variances  which  are  not  affected  by  a  shift  on  the  time  axis. 
The  trending  behavior  of  the  nonstationary  Wiener  process  will  now  be  clarified  by 
two  propositions. 


Hitting  Time 

Let  Tb  be  the  point  in  time  at  which  the  WP  attains  (or  hits)  a  given  value  b  >  0  for 
the  first  time.  By  variable  transformation  it  is  shown  that  this  random  variable  has 
the  distribution  function  (see  Eq.  (7.14)  in  Problem  7.5) 

o  n  OO 

Fb(t)  :=  P (Tb  <t)  =  —=  /  e~y2/2dy. 

v27T  Jb/ft 

Thereby  statement  (a)  from  the  following  proposition  is  proved;  statement  (b)  is 
obtained  by  means  of  the  corresponding  density  function  (see  Problem  7.5). 

Proposition  7.1  (Hitting  Time)  For  the  hitting  time  Tb,  where  the  WP  hits  b  >  0 
for  the  first  time,  it  holds  that: 

(a)  P(Tb  >  t)  0  fort  ->  oo; 

(b)  E(Tb)  does  not  exist. 

The  result  Tb  >  t  is  tantamount  to  the  fact  that  W(s)  has  not  attained  the  value  b 
up  to  t : 


P(Tb  >  t)  =  P  [  max  W(s)  <  b 

\0<s<t 

Laxly  formulated  this  proposition  implies  that,  paradoxically,  (a)  sooner  or  later,  the 
WP  exceeds  every  value  with  certainty;  (b)  on  average,  this  takes  infinitely  long: 
E(Th)  =  oo. 


7The  random  variable  Tb  is  a  so-called  “stopping  time”.  This  is  a  term  from  the  theory  of  stochastic 
processes  which  we  will  not  elaborate  on  here. 
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Zero  Crossing 

Next,  let  p(t\,  t^)  with  0  <  t\  <  ^  be  the  probability  of  the  WP  hitting  the  zero 
line  at  least  once  between  t\  and  ^  (even  if  not  necessarily  crossing  it).  We  then  talk 
about  a  zero  crossing.  The  following  proposition  states  how  to  calculate  it.  For  a 
proof  see  e.g.  Klebaner  (2005,  Theorem  3.25). 

Proposition  7.2  (Arcus  Law) 

The  probability  of  a  zero  crossing  between  t\  and  t2,  0  <  t\  <  t2,  equals 


p{t\ ,  tf)  —  —  arctan 
7 r 

where  arctan  denotes  the  inverse  of  the  tangent  function  tan  = 

It  is  interesting  to  fathom  out  the  limiting  cases  of  Proposition  7.2.  From  the 
shape  of  the  inverse  function  of  the  tangent  function  it  results  that 

7T 

lim  arctan  x  =  —  and  lim  arctan  x  =  0  . 

x->oo  2  x^O 

Hence,  substantially,  for  t2  ->  oo  it  follows  that  the  probability  of  attaining  the  zero 
line  tends  to  one;  for  ^  ->  q,  however,  it  naturally  converges  to  zero. 

In  the  literature,  an  equivalent  formulation  of  the  Arcus  Law  is  found: 


p(t\,  t2)  —  —  arccos 

Tt 

The  equivalence  is  based  on  the  formula 

1 

arctan  x  =  arccos  -  , 

x/TT^? 

see  e.g.  Gradshteyn  and  Ryzhik  (2000,  1.624-8),  where  “arccos”  stands  for  the 
inverse  of  the  cosine  function. 


h  —  h 


h 


7.4  Functions  of  Wiener  Processes 

When  applying  stochastic  calculus,  one  is  often  concerned  with  processes  derived 
from  the  Brownian  motion.  In  this  section,  some  of  these  will  be  covered  and 
illustrated  graphically.  We  simulate  processes  on  the  interval  [0, 1];  for  this  purpose, 
the  theoretically  continuous  processes  are  calculated  at  1000  sampling  points 
and  plotted.  The  resulting  graphs  are  based  on  pseudo-random  variables.  Details 
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Fig.  7.3  WP  and  Brownian  motion  with  cr  =  0.5 


regarding  the  simulation  of  stochastic  processes  are  treated  in  Chap.  12  on  stochastic 
differential  equations. 


Brownian  Motion  B(t ) 

In  Fig.  7.3  a  path  of  a  WP  and  a  Brownian  motion  based  thereon  with  only  half  the 
standard  deviation, 


W(t)  and  B{t)  =  0 .5W(t), 

are  depicted.  Obviously,  the  one  graph  is  just  half  of  the  other. 


Brownian  Motion  with  Drift X(t)  =  /it  +  a  W(t) 

Here  it  holds  that  both  the  expectation  and  the  variance  grow  linearly  with  t : 


X(t)  ~  A f([it,o2i). 
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Fig.  7.4  WP  and  Brownian  motion  with  drift,  where  o  =  1 


In  Fig.  7.4  the  WP  from  Fig.  7.3  and  the  process  based  thereon  with  drift  are 
depicted.  The  drift  parameter  is  (i  —  2  for  a  =  1,  and  the  expectation  function 
2  fis  also  displayed. 


Brownian  Bridge  X(t)  =  B(t )  -  tB{  1) 

This  process  is  based  on  the  Brownian  motion,  B(t)  —  a  W(t),  and  fundamentally, 
it  is  only  defined  for  t  e  [0, 1].  The  name  comes  from  the  fact  that  the  starting  and 
the  final  value  are  equal  with  probability  one  by  construction:  X(0)  =  X(l)  =  0. 
One  can  verify  easily  that  (see  Problem  7.6): 

Var(X(0)  =  t  (1  -  0  a2  <  t a2  .  (7.6) 

Hence,  for  t  £  (0, 1]  it  holds  that  Var(X(f))  <  Var This  is  intuitively  clear: 
With  being  forced  back  to  zero,  the  Brownian  bridge  has  to  exhibit  less  variability 
than  the  Brownian  motion.  This  is  also  shown  in  Fig.  7.5  for  cr  =  1. 
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Fig.  7.5  WP  and  Brownian  bridge  (a  =  1) 


Reflected  Wiener  Process X(t)  =  |W(T)| 


Due  to  W(t)  ~  J\f(0,  t ),  for  the  distribution  function  it  is  elementary  to  obtain  (see 
Problem  7.7): 


P(X(t)  <x)  = 


dy-  1 . 


Note  that  one  integrates  over  twice  the  density  of  a  Gaussian  random  variable  with 
expectation  zero.  Therefore  it  immediately  holds  that 


P (X(t)  <x)  = 


(7.7) 


Expectation  and  variance  of  the  reflected  Wiener  process  can  be  determined  from 
the  corresponding  density  function.  They  read  (see  Problem  7.7): 


E(X(f))  = 


Var(X(0)  =  t 


<  t  —  Var(  W  ( /  ) ) . 


(7.8) 


7.4  Functions  of  Wiener  Processes 


165 


Fig.  7.6  WP  and  reflected  WP  along  with  expectation 


As  the  reflected  Wiener  process  cannot  become  negative,  it  has  a  positive  expected 
value  growing  with  t.  For  the  same  reason  its  variance  is  smaller  than  the  one  of  the 
unrestricted  Wiener  process,  see  Fig.  7.6. 


Geometric  Brownian  Motion  X(t)  =  e^t+(T 

By  definition,  it  holds  in  this  case  that  the  logarithm  of  the  process  is  a  Brownian 
motion  with  drift  and  therefore  Gaussian, 

logX(7)  =  fit  +  g  W(t )  ~  o2t) . 

A  random  variable  Y  whose  logarithm  is  Gaussian  is  called  -  as  would  seem  natural 
-  log-normal  (logarithmically  normally  distributed).  If  it  holds  that 

logy  ~  Nijiy, of). 


8By  “log”  we  denote  the  natural  logarithm  and  not  the  common  logarithm. 
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then  we  know  how  the  two  first  moments  of  Y  look  like,  cf.  e.g.  Sydsaeter,  Str0m, 
and  Berck  (1999,  p.  189)  or  Johnson,  Kotz,  and  Balakrishnan  (1994,  Ch.  14): 

E (Y)  =  e^+ay/2  ,  Var(E)  =  e2^+lTy  (eay  -  l)  . 

Hence,  by  plugging  in  we  obtain  for  the  geometric  Brownian  motion 

E(X(0)  =  e<'^+<T2/2,'  and  Var(X(0)  =  e(2fl+<,2)t  ( e°2t  -  1) .  (7.9) 

While  logX(t)  is  Gaussian  with  a  linear  trend,  fit ,  as  expectation,  X(^)  exhibits 
an  exponentially  growing  expectation  function.  Particularly  for  fi  —  0  and  a  —  1 
the  results 


E(X(t))  =  e'12  and  Var(X(/))  =  e*  ( e ‘  -  1)  (7.10) 

are  obtained.  The  on  average  exponential  growth  in  the  case  of  fi  >  —  <t2/2  is 
illustrated  in  Fig.  7.7.  In  Fig.  7.8  we  find  graphs  of  the  WP  and  a  geometric  Brownian 
motion  with  expectation  one,  namely  with  fi  —  —0.5  and  <7  =  1.  Generally,  for 
fi  =  —a2/ 2  an  expectation  function  of  one  is  obtained.  Then,  one  also  says  that  the 
process  does  not  exhibit  a  trend  (or  drift). 


Fig.  7.7  Geometric  Brownian  motion  with  /x  =  1.5  and  a  =  1  along  with  expectation 
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Fig.  7.8  WP  and  geometric  Brownian  motion  with  /x  =  —0.5  and  o  =  1 


In  comparison  to  the  expectation,  the  median  of  a  geometric  Brownian  motion 
does  not  depend  on  a.  Rather,  it  holds  that  (see  Problem  7.8): 

p  ^t+aW(l)  <  =  0.5  , 

such  that  the  median  results  as 


Maximum  of  a  \NP  X(t)  =  maxo<s<f  W(s) 

At  t ,  the  maximum  process  is  assigned  the  maximal  value  which  the  WP  has  taken 
on  up  to  this  point  in  time.  Therefore,  in  periods  of  a  decreasing  Wiener  process 
path,  X(^)  is  constant  on  the  historic  maximum  until  a  new  relative  maximum  is 
attained.  However,  this  process  has  a  distribution  function  that  we  have  already  come 
to  know.  By  applying  the  distribution  function  of  the  stopping  time  which  is  given 
above  Proposition  7.1,  one  shows  (see  Problem  7.9)  that  the  maximum  process  and 
the  reflected  WP  are  equal  in  distribution: 

P(X(0  <  b)  =  P(|W(0l  <  b)  • 
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Fig.  7.9  WP  and  maximum  process  along  with  expectation 


Therefore,  expectation  and  variance  of  the  maximum  process  of  \W(t)  \  can  naturally 
be  copied: 


E(X(f))  =  y  ^  ,  Var(X(Y))  =  t  <  t  =  Var(W(f))  .  (7.11) 

The  expected  value  is  positive  and  grows  with  time  as  the  WP  again  will  replace  a 
relative  positive  maximum  by  a  new  relative  maximum.  Due  to  the  process  being 
again  and  again  constant  over  times,  it  is  not  surprising  that  its  variance  is  smaller 
than  the  one  of  the  underlying  WP,  cf.  Fig.  7.9. 


Integrated  Wiener  Process  X(t)  =  f‘  W (s)  ds 

As  the  Brownian  motion  is  a  continuous  function,  the  Riemann  integral  can  be 
defined  pathwise.  Hence,  e.g.  the  following  random  variable  is  obtained: 


dt . 
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Behind  this  random  variable  hides  a  normal  distribution.  The  latter  can  be  proved 
by  using  the  definition  of  the  Riemann  integral  or  as  a  simple  conclusion  of  the 
Proposition  8.3  below: 


l 

W(t)dt  ~  A/\0, 1/3).  (7.12) 

Basically,  by  using  the  integral  of  a  WP,  a  new  stochastic  process  can  also  be 
generated  by  making  the  upper  limit  of  integration  time-dependent: 

X(t)  —  f  W(s)ds. 

Jo 

This  idea  forms  the  starting  point  for  the  subsequent  chapter.  In  Fig.  7.10  the  relation 
is  shown  between  the  WP  and  the  integral  X(t)  as  the  area  beneath  the  graph. 


Fig.  7.1 0  WP  and  integrated  WP 
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7.5  Problems  and  Solutions 


Problems 


7.1  Consider  with  a  >  0 


X(t)  =  W(  1)  -  aW(  1  -t)  for  0  <  t  <  1  . 


Determine  the  mean  and  variance  of  X(t). 


7.2  Consider 


X(t)  =  tW(t~l)  for  t  >  0  . 


Determine  the  covariance  of  X(t )  and  W(t),  Co v(X(t),  W(t)). 

7.3  Derive  the  autocovariance  function  of  a  WP,  (7.4).  Find  a  simple  expression  in 
t  and  s  only  for  the  autocorrelations 


CovfWYrt.  W(s)') 


7.4  Choose  JeM  such  that  Td~ 0,5  W(t)  and  W(T  t)  are  equal  in  distribution. 

7.5  Prove  Proposition  7. 1  using  the  hints  given  in  the  text. 

7.6  Derive  the  autocovariance  function  of  a  Brownian  bridge,  and  hence  show  (7.6). 

7.7  Determine  the  distribution  function,  (7.7),  and  the  moments,  (7.8),  of  a  reflected 
Wiener  process. 

7.8  Show  that  in  the  general  case  of  a  geometric  Brownian  motion,  e M  t+°  ,  the 

median  is  given  by  eflt. 

7.9  Show  by  means  of  the  hints  in  the  text  that  the  maximum  process  of  a  WP  and 
the  corresponding  reflected  WP  are  equal  in  distribution: 


max  W(s)  <  b 

()<v<f 


P 


=  P(\W(t)\  <  b) . 
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Solutions 


7.1  As  the  Wiener  process  is  on  average  zero  at  every  point  in  time,  this  obviously 
holds  for  X(t)  as  well.  Therefore,  the  variance  is  calculated  as  follows: 

Var(X(0)  =  E(X2(0) 

=  E[W2(1)  -  2oW(\)W{\  -  t)  +  a2W2(l  -  t)] 

=  Var(W(l))  -  2<rCov(W(l),  W(  1  -  t))  +  a2Var(W(l  -  t)) 

=  1  —  2a  min(l,  1  —  t)  +  a2(l  —  t) 

—  1  —  2a(l  —  t)  +  cr2(l  —  t) 

—  t  +  (1  —  t)(  1  —  cr)2  . 

7.2  Due  to  E(W(f))  =  E(X(^))  =  0  one  obtains: 

Co 

—  t  min (t~l  ,t). 


Because  of 


min  (t  1 ,  t)  — 


t,  0  <  t  <  1 

r1 ,  t  >  l 


it  follows  that 

it2  0  <  t  <  1 

Cov(X(0,W(0)  =  j  j  ’  JVi  • 

7.3  We  simply  apply  the  defining  properties  (Wl),  (W2)  and  (W3)  or  put 
differently  (7.3).  Due  to  (7.3)  the  WP  has  an  expectation  of  zero  such  that 

Cov  (W(0,  W(s))  =  E  (W(t)W(s))  . 


W.l.o.g.  let  s  <  t.  By  using  (Wl)  and  (W2)  and  after  adding  zero,  we  then  write: 

E(W(t)W(s))  =  E([W(j)  +  W(t)  -  W(s)]  [W(s)  -  W(0)]) 

=  E  ([W(^)]2)  +  E  ([W(t)  -  W(.v)]  [W(s)  -  W(0)]) 

—  s  +  0 , 
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where  the  last  equality  uses  Var(W(.0)  =  s  and  the  independence  of  non¬ 
overlapping  increments.  As  one  could  also  assume  t  <  s  w.l.o.g.,  (7.4)  is  verified. 
With  the  autocovariance  one  obtains 


min^,  t) 
max  (.s',  t) 

7.4  This  is  a  problem  on  self- similarity  or  scale  invariance.  Due  to  (7.3)  it  obviously 
holds  that: 


p(s,  t)  = 


min^,  t) 


min^,  t) 


Vts  y  max^,  t)  min(^,  t) 


Td~°'5W(t)  ~  J\f(0,T2d~lt ) 


and 


W(Tt)  ~  A f(0,  Tt) . 

Therefore,  the  corresponding  variances  are  equal  for  d  —  1.  The  corresponding 
result  is  obtained  from  (7.5)  as  well: 

T05W(0  -  W(Tt) . 

7.5  Proof  of  Proposition  7.1(a):  Our  proof  consists  of  three  steps.  At  first,  we 
establish  the  equation 


P (Tb  <0  =  2  P(W(0  >  b) . 

Secondly,  by  using  this  we  show: 


(7.13) 


2  C 00 

Fb(t)  :=  P(Tb  <  t)  = -=  e~?  l2dy.  (7.14) 

V  27T  Jb/s/t 

Note  that  in  (7.14)  the  integrand  amounts  to  the  density  function  of  the  standard 
normal  distribution.  Hence,  thirdly  for  t  oo  the  claim  immediately  follows 
from  (7.14)  due  to  P(7^  >0  =  1—  P (Tb  S  0- 

In  order  to  accept  (7.13),  we  remember  (7.3).  Accordingly,  for  Tb  <  t  it  holds 
that 


W{t)  -  W(Tb)  -  W(0,  t-Tb), 

which  is  why  from  the  symmetry  of  the  Gaussian  distribution  with  W(Tb )  =  b  it 
follows  for  the  conditional  probability  that: 


Tb  <  0  = 


1 

2  ‘ 


P  (W(0  >  b  |  Tb  <  0  =  P  (W(0  -  W(7*)  >  0 
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Hence,  it  results  that: 


1  _  P(W(0  >  b  and  Tb  <  t) 

2  "  P( H  <  t) 


?(W(t)  >  b) 
P(Th  <  t) 


where  the  last  equality  is  caused  by  the  fact  that  Tb  <  t  is  implied  by  W(t)  >  b. 
Thus,  we  obtain  Eq.  (7.13)  which  will  now  be  applied  for  deriving  (7.14). 

Due  to  (7.3)  it  holds  by  definition  that 


poo  1 

P(W(^)  >  b)  —  /  p~x2/2td,x  . 

Jb  V2n t 


By  variable  transformation,  y  —  -^f,  it  follows  that 


P(W(0  >  b)  = 


_  [°°  1 

J bj sft  \j 


-y 


72 dy , 


whereby  (7.14)  and  hence  claim  (a)  is  proved  due  to  P(Tb  <  t)  —  P(Tb  <  t). 

Proof  of  Proposition  7.1(b):  With  the  density  function  fb(t)  —  F'b(t )  the  approach 
for  the  expected  value  reads  as  follows: 


poo 

E  (Th)  =  /  tfb(t)  dt . 

Jo 

Note  that  the  distribution  function  derived  in  (7.14)  has  the  following  structure  with 
the  antiderivative  H,H'  =  h: 

poo 

Fb(t )  =  /  h(y)  dy  =  lim  H(c)  -  H(g(t )) . 

Jg(t)  c^°° 

Therefore,  due  to  the  chain  rule  it  holds  for  the  density  that 


b ^  3 


F'h(t)  =  ~h  (git)  )  g\t )  = 


be  2t  t  2 


V2 


7 r 


Hence,  the  variable  transformation  results  in  t  —  b2u  2  with  dt  —  —2  b2  u  3du\ 


poo 

E  (Tb)  =  /  tF'h(t)dt 
Jo 

L 


\I~2jx  jo 

ib2  r°° 

\[2jx  Jo 


OO  9 

-F  _i  , 
e  21 1  2dt  — 


-2b2  f° 

J 2tt  J oo 


—  —  —2  7 

e  2  u  du 


_ u_  _ ^ 

e  2  u  du 
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—  Ml  —1  7 

e  2  u  du 

2 b2  _i  fl  _2 
>  _ e  2  I  u  du. 

V  jo 

However,  this  last  integral  written  down  symbolically  does  not  exist  since  the 
antiderivative  of  w-2  is  —  u~l ,  and 

l 

w-2  du  —  s~l  —  1 

diverges  as  s  ->  0.  This  completes  the  proof. 

7.6  First,  we  determine  that 

X(t)  =  W(t)-tW{  1),  re  [0,1], 

has  zero  expectation: 

E(X(t))  =  E(W(0)  -  *  E(W(1))  =  0  -  0  . 

By  multiplying  out  and  application  of  (7.4)  one  can  show  that 

Cov(X(t),  X(s))  =  E(X(t)X(s)) 

=  E  (W(t)W(s)  -  *W(l)W(s)  -  jW(1)W(0  +  stW2(  1)) 

=  min^,  t)  —  t  min (s,  1)  —  smin(t,  1)  +  st 

—  min^,  t)  —  st  —  st  +  st 

—  min^,  t)  —  st . 

In  particular,  for  s  =  t  the  variance  formula  (7.6)  is  obtained. 

7.7  At  first  we  determine  the  distribution  function  (7.7)  for  X(t)  =  \  W(t)  | : 

Fx(x)  =  P(X(t)  <  x)  ,  v  >  0 , 

=  P (W(t)  <x)~  P (W(t)  <  -jc) 

=  P(W(0  <  x)  -  (1  -  P(W(0  <  x)) 

=  2P (W(t)  <*)-!, 
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where  the  symmetry  of  the  Gaussian  distribution  was  used.  With  W{t)  ~  J\f(0,  t) 
we  therefore  have 


Fx(x)  = 


\/2nt .  l—oo 


L 


X 


2 1 


dy-  1 


or  for  the  density 


fx(x)  =  F'Jx)  = 


2 

-  P 

xjlnt 


In  order  to  calculate  expectation  and  variance,  we  determine  the  r-th  moment  in 
general: 


POO 

Jo 


E(Xr(t))  =  /  xrfx(x)dx  = 


\f2nt 


r- 

I  x  e 

Jo 


2t  dx . 


By  substitution,  these  moments  can  be  reduced  to  the  Gamma  function,  which  was 
introduced  in  Problem  5.3,  see  also  below  Eq.  (5.20).  For  a  >  0  and  with  ax2  —  u 
and  du  —  2 axdx,  we  obtain: 


The  Gamma  function  possesses  a  number  of  nice  properties  and  special  values, 
remember  in  particular  e.g. 

r(0  =  1,  r  Qj  =  Vn  ,  r(n+l)=nr(n). 

With  a  —  for  the  moments  it  therefore  follows  that: 


E  «'»  =  >/!•  E(^(,))  =  7|=i(2o4r(i)  =  , 


The  variance  formula  is  obtained  by  the  usual  variance  decomposition: 


Var(X(f))  =  E  (X2(t))  -  (E(X(t)))2  . 
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7.8  The  random  variable  a  W(t )  follows  for  fixed  t  a  Gaussian  distribution  with 
expectation  and  median  equal  to  zero.  Hence,  it  follows  that 

P(<t  W(t)  <  0)  =  P  (ea  w{,)  <  1)  =  0.5  . 

By  multiplying  the  inequality  by  we  obtain 

p(^o-W(0  <  =  p  (etit+crW(t)  0.5. 

Therefore,  the  median  of  X(t)  —  e^t+aW^  is  determined  independently  of  a  as 
as  claimed. 

7.9  As  X(t)  =  maxo<s</  W(s)  is  a  continuous  random  variable  for  given  t,  it  holds 
that 


Fx(b)  :=  P (X(t)  <b)=  P 


(  max  fP(is)  <  b 

\()<s<t 


Remember  the  random  variable  7^  from  Proposition  7.1  specifying  the  point  in  time 
at  which  W(t)  hits  the  value  b  for  the  first  time.  The  event  max0<5</  W(s)  <  b  is 
equivalent  to  the  fact  that  the  hitting  time  of  b  is  larger  than  t.  Therefore,  when  using 
the  distribution  function  from  Proposition  7.1,  it  holds  that 


P(*(0  <  b)  =  1  -V(Tb  <  0  =  1  -  ~^=  r 

\J2ll  Jb/y/i 

Naturally,  the  number  1  can  be  written  as  an  integral  over  the  density  of  the  standard 
normal  distribution: 


By  substitution, 


z 


and 


and  due  to  (7.7)  the  desired  result  is  immediately  obtained: 


nxu)  <  b)  = 


f 


exp 


\Jlnt 
=  PQW(t)\<b) 


y 

it 
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Riemann  Integrals 


8 


8.1  Summary 

In  this  chapter  we  deal  with  stochastic  Riemann  integrals,  i.e.  with  ordinary 
Riemann  integrals  with  a  stochastic  process  as  the  integrand.  Mathematically, 
these  constructs  are  relatively  unsophisticated,  they  can  be  defined  pathwise  for 
continuous  functions  as  in  conventional  (deterministic)  calculus.  However,  this 
pathwise  definition  will  not  be  possible  any  longer  for  e.g.  Ito  integrals  in  the  chapter 
after  next.  Hence,  at  this  point  we  propose  a  way  of  defining  integrals  as  a  limit  (in 
mean  square)  which  will  be  useful  later  on.  If  the  stochastic  integrand  is  in  particular 
a  Wiener  process,  then  the  Riemann  integral  follows  a  Gaussian  distribution  with 
zero  expectation  and  the  familiar  formula  for  the  variance.  A  number  of  examples 
will  facilitate  the  understanding  of  this  chapter. 


8.2  Definition  and  Fubini's  Theorem 

As  one  has  done  in  deterministic  calculus,  we  will  define  the  Riemann  integral  by 
an  adequate  partition  as  the  limit  of  a  sum. 


Partition 

In  order  to  define  an  integral  of  a  function  from  0  to  /,  we  decompose  the  interval 
into  n  adjacent,  non-overlapping  subintervals  which  are  allowed  to  intersect  at  the 


Bernhard  Riemann  (1826-1866)  studied  with  Gauss  in  Gottingen  where  he  himself  became 
a  professor.  Already  before  his  day,  integration  had  been  used  as  a  technique  which  reverses 
differentiation  by  forming  an  antiderivative.  However,  Riemann  explained  for  the  first  time  under 
which  conditions  a  function  possesses  an  antiderivative  at  all. 
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endpoints: 

Pn  ([0,  t])  :  0  —  so  <  s\  <  . . .  <  sn  —  t .  (8.1) 

In  the  following,  we  always  assume  that  the  partition  Pn  ([0,  t ])  becomes  increas¬ 
ingly  fine  with  n  growing  (“adequate  partition”): 

max  (si  —  i)  — >  0  for  n  ->  oo  .  (8.2) 

1  <i<n 

By  s*  we  denote  an  arbitrary  point  in  the  i- th  interval, 

s*  e[si-i,Si],  i  =  1, ...  ,n. 

Occasionally,  we  will  sum  up  the  lengths  of  the  subintervals.  Obviously,  it  holds 
that 

n 

T,  (Si  -  s,-l)  =  sn-s0=t. 

i=  1 

In  general,  for  a  function  (p  one  obtains: 

n 

X!  ('H'O  -  =  <p(t)  -  m  ■  (8.3) 

i=  1 

Sometimes,  we  will  operate  with  the  example  of  the  equidistant  partition.  It  is  given 
by  Si  =  it/n : 

t  n  —  1 

0  =  5,o<Sl  =  -<  ...  <  sn- 1  =  - 1  <  Sn  —  t . 

n  n 

Due  to  Si  —  St- 1  =  1  /n  the  required  refinement  from  (8.2)  for  ^  ->  oo  is  guaranteed. 

Definition  and  Existence 

Now,  the  product  of  a  deterministic  function  /  and  a  stochastic  process  X  is  to 
be  integrated.  To  this  end,  the  Riemann  sum  is  defined  by  means  of  the  notation 
introduced: 

n 

R„  =  'y'j(s*)X(s*)  ( Si  -  Si- 1).  (8.4) 

i=  1 

Here,  we  have  a  sum  of  rectangular  areas,  each  with  a  width  of  (si  —  Si- 1)  and  the 
height  f(sf)  X(s*).  With  n  growing,  the  area  beneath  f(s)X(s)  on  [0,  t]  is  to  be 
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approximated  all  the  better.  If  the  limit  of  this  sum  for  n  ->  oo  exists  uniquely 
and  independently  of  the  partition  and  of  the  choice  of  s* ,  then  it  is  defined  as  a 
(stochastic)  Riemann  integral.  In  this  case,  the  convergence  occurs  in  mean  square 


”\2, 


(“-►”) 


R„  =  J2f(s*)X(s*)  (Sj  -  si-] )  4  f  /(.v)  X(s)  ds  . 
i=i  Jo 

One  then  says  that  the  Riemann  integral  exists.  For  this  existence,  there  is  a 
sufficient  and  a  necessary  condition  formulated  in  the  following  proposition.  The 
proof  is  carried  out  with  part  (b)  from  Lemma  8.2  below,  see  Problem  8.1.  Further 
elaborations  on  mean  square  convergence  can  be  found  at  the  end  of  the  chapter. 

Proposition  8.1  (Existence  of  the  Riemann  Integral)  The  Riemann  sum  from 
Eq.  (8.4)  converges  in  mean  square  for  n  oo  under  (8.2)  if  and  only  if  the  double 
integral 


mm  E  (X(s)  X(r)) 


drds 


exists. 

A  sufficient  condition  for  the  existence  of  the  Riemann  integral  is  that  the 
function  /  is  continuous  and  that  furthermore  E(X(y)X(r))  is  continuous  in  both 
arguments.  In  order  to  find  this,  we  define 

(p(s)  :=f(s)  f  f(r)E(X(s)X(r))  dr. 

Jo 

Now,  if  the  function  E  (X(s)X(r))  is  continuous  in  both  arguments,  this  implies 
continuity  of  ip  for  a  continuous/  as  the  integral  is  a  continuous  functional,  see  e.g. 
Trench  (2013,  p.  462).  Therefore,  the  ordinary  Riemann  integral  of  ip  exists, 

[  <p(s)ds=  f  [  f(s)  f(r)  E  (X(s)  X(r))  drds. 

Jo  Jo  Jo 

Hence,  the  Riemann  sum  from  (8.4)  converges  due  to  Proposition  8.1. 


2  A  definition  and  discussion  of  this  mode  of  convergence  can  be  found  in  the  fourth  section  of  this 
chapter. 
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Fubini's  Theorem 

Frequently,  we  are  interested  in  the  average  behavior,  i.e.  the  expected  value  of 
Riemann  integrals.  The  expected  value,  however,  is  defined  as  an  integral  itself  such 
that  one  is  confronted  with  double  integrals.  For  calculating  these,  there  is  a  simple 
rule  which  is  finally  based  on  the  fact  that  the  order  of  integration  does  not  matter  for 
double  integrals  over  continuous  functions.  In  deterministic  calculus,  this  fact  is  also 
known  as  “Fubini’s  theorem”  (see  Footnote  6  in  Sect.  2.3).  Adapted  to  our  problem 
of  the  expected  value  of  a  Riemann  integral,  the  corresponding  circumstances  are 
given  in  the  following  proposition,  also  cf.  Billingsley  (1986,  Theorem  18.3). 

Proposition  8.2  (Fubini’s  Theorem)  If  E(\X  (s)  |)  ds  exists ,  it  holds  for  a  contin¬ 
uous  X  that: 


X(s)  ds 


H 


E(X(s))  ds 


The  statement  is  easy  to  comprehend  if  one  thinks  of  the  integral  as  a  finite 
Riemann  sum.  As  is  well  known,  in  the  discrete  case,  summation  and  expectation  is 
interchangeable: 


(n  \  n 

y>(**)  (Si  -  Si- 1)  =  £  E(X(s*))  (Si  -  Si- 1)  . 

i=  1  /  i=  1 

Now,  Fubini’s  theorem  just  guarantees  a  continuation  of  this  interchangeability  for 
n  —>  oo. 

Example  8.1  ( Expected  Value  of  the  Integrated  WP)  Consider  the  special  case  of 
the  integrated  WP  with  X(s)  =  W(s)  and  f(s)  =  1.  With  the  WP  being  continuous, 
\W(t)\  is  a  continuous  process  as  well.  In  (7.8)  we  have  determined  the  following 
expression  as  the  expected  value: 


E(|W(r)|)  = 


Before  applying  Proposition  8.2,  we  check: 


8.3  Riemann  Integration  of  Wiener  Processes 


183 


As  this  quantity  is  finite,  the  requirements  of  Fubini’s  theorem  are  fulfilled.  Hence, 
it  follows  that 


E 


W  (s)  ds 


=  /  e  (W(s)) 
Jo 


General  Rules 


Note  that  our  definition  of  the  integral  seems  to  be  unnecessarily  restrictive. 
However,  the  restriction  on  the  interval  [0,  t]  is  by  no  means  crucial.  The  usual  rules 
apply  and  are  here  symbolically  described  for  an  integrand  g  (without  proof): 

fb  fc  fb 

/  g(v)  dx  =  /  g(v)  dx  +  /  g(v)  dx  for  a  <  c  <  b  , 

J  a  J  a  J  c 


/ 


■/ 


i  (A)  +  g2 (x) )  dx  =  /  g I  (x)  dx  +  /  g2(x)dx. 


/ 


/ 


cg(x)  dx 


=  cf 


g(v)  dx  for  c  e  R 


8.3  Riemann  Integration  of  Wiener  Processes 

In  this  section,  we  concentrate  on  Riemann  integrals  where  the  stochastic  part  of  the 
integrand  is  a  WP:  X(t )  =  W(t). 


Normal  Distribution 


Frequently,  Gaussian  random  variables  are  hidden  behind  Riemann  integrals.  In 
fact,  it  holds  that  all  of  the  integrals  discussed  in  this  section  follow  Gaussian 
distributions  with  zero  expectation.  The  variances  can  be  determined  according  to 
the  following  proposition  (for  a  proof  see  Problem  8.3). 


Proposition  8.3  (Normality  of  Riemann  Integrals)  Let  f  be  a  continuous  deter¬ 
ministic  function  on  [0,  t\.  Then,  it  holds 


f(r)f(s)  min(r,  s)drds 


) 


The  normality  follows  from  the  fact  that  the  WP  is  a  Gaussian  process.  Hence, 
the  Riemann  sum  Rn  from  (8.4)  follows  a  Gaussian  distribution  for  finite  n.  As 
Rn  converges  in  mean  square,  it  follows  from  Lemma  8.1  (see  below)  that  the 


184 


8  Riemann  Integrals 


limit  is  Gaussian  as  well.  Note  that  the  finiteness  of  the  variance  expression  from 
Proposition  8.3  is  just  sufficient  and  necessary  for  the  existence  of  the  Riemann 
integral  (Proposition  8.1). 


Example  8.2  ( Variance  of  the  Integrated  WP)  Consider  an  integrated  WP  with 
f(s)  =  1  as  in  Example  8.1.  We  look  for  a  closed  expression  for  the  variance  of 
/q  W(s)ds.  Due  to  Proposition  8.3,  the  starting  point  is: 


Var 


min(r,  s)  drds . 


Now,  we  employ  a  useful  trick  for  many  applications.  The  integral  with  respect  to  r 
is  decomposed  into  the  sum  of  two  integrals  with  s  as  the  integration  limit  such  that 
the  minimum  function  can  be  specified  explicitly: 


nt  nt  f  f*s  f*t 

min(r,  s)  drds  =  /  min(r,  s)  dr  -\-  /  min(r,  s)  dr 

Jo  Jo  Js 


0  LVO 
>t  r  ns 


ds 


=  f  f  r  dr  +  f  sdr 

Jo  IJo  Js 


ds . 


Now,  the  integration  of  the  power  functions  yields  the  requested  variance: 


Var  I  j  W(s)ds 


=  f  [  r dr  +  f  sdr 
Jo  UO  Js 

-f 


ds 


—  +s(t-s) 


ds 


r s2  t  s 


6 


Jo 


Autocovariance  Function 


With  the  time-dependent  integration  limit,  fQf(s)W(s)ds  itself  is  a  stochastic 
process.  Therefore,  it  suggests  itself  to  not  only  determine  the  variance  as  in 
Proposition  8.3,  but  the  covariance  function  as  well.  The  general  result  is  given 
in  the  following  proposition,  which  will  be  verified  in  Problem  8.7. 
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Proposition  8.4  (Autocovariance  Function)  For  a  continuous  function  f  with 
integrable  antiderivative  F  and  Y(t)  —  f^fis)  W(s)ds  it  holds  that: 


E(Y(t)  Y(t  +  h)) 


s  F(s)  —  f  F(r)dr  +  s(F(t  +  h)  —  F(s)) 

Jo 


where  h  >  0. 


Therefore,  with  h  —  0  an  alternative  expression  for  the  variance  from  Propo¬ 
sition  8.3  is  obtained.  For  concrete  functions  /,  the  formula  can  be  simplified 
considerably.  This  is  to  be  shown  by  the  following  example. 


Example  8.3  ( Auto  covariance  of  the  Integrated  WP)  Once  again,  we  examine  the 
integrated  WP  with  f(s)  —  1  and  F(s )  =  s  as  in  Examples  8.1  and  8.2.  Then, 
plugging  in  yields: 


rt+h 

E  I  j  W(s)ds  J  W(r)dr 


H 

■l 


t  r 


—  -s2  +  s((t  +  h)  —  s) 


ds 


0  L 

.2 


1  , 

s(t  +  h)  ~  -s‘ 


ds 


tz  ( t  +  h)  t 3 

2  ~6 


Hence,  for  h  —  0  the  variance  of  the  integrated  Wiener  process  reads: 


Var 


W(s)ds 


t 


3 


Of  course,  we  already  know  this  from  Example  8.2.  ■ 


Examples 

For  three  special  Gaussian  integrals,  which  we  will  be  confronted  with  over  and 
over,  the  variances  are  to  be  calculated.  We  put  the  results  in  front. 
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Corollary  8.1  It  holds  that 


l 

(a)  f  W(s)ds  ~A/"(0, 1/3), 
o 

l 

(b)  W(l)-f  W(s)  ds  ~  Af(0, 1  /3) , 

0 

1 

(c)  f(s  —  c )  W(y)  ~  AA(0,  crj)  , 
o 


where  c  G  R  a/zJ  cr|  =  8  25^}~20c2  >  0. 


The  normality  in  (a)  and  (c)  is  clear  due  to  Proposition  8.3.  In  (b)  we  have  the  sum 
of  two  Gaussian  random  variables  which  does  not  necessarily  have  to  be  Gaussian 
again  unless  a  multivariate  Gaussian  distribution  is  present.  Thus,  the  normality  of 
(b)  can  only  be  proven  in  connection  with  Stieltjes  integrals  (see  Problem  9.2). 

The  result  from  (a)  is  a  special  case  of  Example  8.2  with  t  —  1.  We  show  in 
Problem  8.4  that  the  variance  in  (b)  is  just  1/3.  The  proof  of  (c)  for  c  —  0  is 
given  in  Problem  8.5;  for  an  arbitrary  c,  the  proof  is  basically  similar,  however,  it 
gets  computationally  more  involved.  Note  that  the  variance  cannot  be  zero  or 
negative  for  any  c  (Problem  8.6). 

Again,  there  should  be  a  word  of  warning  concerning  equality  in  distribution. 
From  (b)  it  follows  that: 


l 

W(s)ds-W(  1)  -  jV(0,  1/3)  . 
Therefore,  the  following  random  variables  are  equal  in  distribution, 


W (s)  ds  , 


although,  pathwise  it  obviously  holds  that: 


W (s)  ds  . 


8.4  Convergence  in  Mean  Square 

Now,  we  hand  in  some  basics  which  are  not  necessary  for  the  understanding  of 
Riemann  integrals;  however,  they  are  helpful  for  some  technical  properties.  In 
particular,  for  the  elaboration  on  the  Ito  integral  following  below,  the  knowledge 
of  convergence  in  mean  square  is  advantageous  for  a  complete  understanding.  For 
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a  brief  introduction  to  the  basics  of  asymptotic  theory,  Potscher  and  Prucha  (2001) 
can  be  recommended. 


Definition  and  Properties 

Let  {Xn},  n  e  N,  be  a  sequence  of  real  random  variables  with 

E(X2 )  <  oo  .  (8.5) 

For  a  sequence  {Xn}  and  a  random  variable  X,  we  define  the  mean  squared  error 
as  distance  or  norm: 


MSE(Xn,X)  :=  E  [(Xn  —  X)2]  , 

One  says,  {Xn}  converges  in  mean  square  to  X  for  n  tending  to  infinity  if 

MSE(X„,X) 

Abbreviating,  we  write  for  this  as  well 

Xn 

This  limit  is  unique  with  probability  one.  Of  course,  it  can  be  a  random  variable  itself 
or  a  constant.  In  any  case,  due  to  (8.5)  it  holds  that:  E(X2)  <  oo.  In  fact,  expected 
value  and  variance  of  X  can  be  determined  from  the  moments  of  Xn .  In  particular,  the 
limit  of  Gaussian  random  variables  is  again  Gaussian.  More  precisely,  the  following 
lemma  holds  (see  Problem  8.8  for  a  proof). 

Lemma  8.1  (Properties  of  the  Limit  in  Mean  Square)  Let  {Xn}  with  (8.5) 
converge  in  mean  square  to  X.  Then  it  holds  for  n  —>  oo: 

(a)  E(Xn)  -*  E(X); 

(b)  E(X2n )  E(X2); 

(c)  if  {Xn}  is  Gaussian ,  then  X  follows  a  Gaussian  distribution  as  well. 

Naturally,  the  parameters  of  the  Gaussian  distribution  X  from  (c)  follow  accord¬ 
ing  to  (a)  and  (b). 


Convergence  to  a  Constant 

If  the  limit  is  a  constant,  then  it  is  particularly  easy  to  establish  convergence  in  mean 
square.  For  this  purpose,  we  consider  the  following  derivation.  By  zero  addition  and 
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the  binomial  formula,  the  following  expression  is  obtained: 

[X„  -  X]2  =  [(Xn  -  E(X„))  -  (X  -  E (Xn))f 

=  (Xn  -  E (Xn)  f  -  2(Xn  -  E(X„))(X  -  E(Xn))  +  (X  -  E(X„))2. 

The  expectation  operator  yields: 

MSE(X„,  X)  =  Var(X„) 

-2  E  [(X„  -  E(X„))(X  -  E(X„))]  +  E  [(X  -  E(X„))2]  . 

If  X  is  a  constant  (a  “degenerate  random  variable”),  X  =  c,  then  the  second  term 
becomes  zero  and  the  third  term  is  the  expected  value  of  a  constant.  In  other  words, 
this  yields: 


MSE(X„,  c)  =  Var(X„)  +  [c  -  E(X„)]2  . 

Hence,  {Xn}  converges  in  mean  square  to  a  constant  c  if  and  only  if  it  holds  that 

Var(Xn)  ->  0  and  E(Xn)  — >  c ,  n  — >►  oo  . 

As  is  well  known,  this  implies  that  {Xn}  converges  to  c  in  probability  as  well  (see 
Lemma  8.3  below).  Next,  we  cover  criteria  of  convergence. 


Test  of  Convergence 

Now,  we  still  need  a  convenient  criterion  in  order  to  decide  whether  a  series  is 
convergent  in  mean  square.  In  fact,  we  have  two  equivalent  criteria.  For  the  proof 
see  Problem  8.9.  The  name  goes  back  to  the  famous  French  mathematician  Augustin 
Louis  Cauchy  (1789-1857). 

Lemma  8.2  (Cauchy  Criterion)  A  series  {Xn}  with  (8.5)  converges  in  mean 
square  . . . 


(a)  ...  if  and  only  if  it  holds  for  arbitrary  n  and  m  that 


( Xm 


m,  n 


oo ; 


(b)  ...  or  put  equivalently \  if  and  only  if  it  holds  for  arbitrary  n  and  m  that 

E  (Xm  Xf)  —>  c  <  o o  ,  m,  n  — >  oo  , 


where  c  el  is  a  constant. 
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Note  that  the  convergence  of  the  Cauchy  criterion  holds  independently  of  how  m 
and  n  tend  to  infinity.  As  well,  the  constant  c  results  independently  of  the  choice  of 
m  and  n.  As  the  criteria  are  sufficient  and  necessary,  the  proof  of  existence  for  mean 
square  convergence  can  be  supplied  without  determining  the  limit  explicitly. 


Example  8.4  (Law  of  Large  Numbers )  Let  {st}  be  a  white  noise  process,  st  ~ 
WN(0,  a2).  Then,  it  can  be  shown  that  the  arithmetic  mean, 

1  v- 

Xn  • —  &n  —  —  /  &t  5 

n 

t=  l 

converges  in  mean  square  without  specifying  the  limit.  It  namely  holds  that 


E  ( en  £m)  —  E 
mn 


n  m 

J2£>  Ee> 

L/=l  t=  1  J 


min  (ft, m) 


mn 


E  E(e?) 


t=  i 


=  O' 


min  (A,  m) 


0 


mn 


for  ftz,ft  ->  oo.  Due  to  Lemma  8.2(b)  we  conclude  that  sn  has  a  limit  in  mean 
square. 

Let  the  limit  of  sn  simply  be  called  s.  Naturally,  it  can  be  determined  immediately. 
Due  to 


E  (sn)  =  0  and  Var  (en)  = 


o^ 


ft 


it  follows  from  Lemma  8.1(a)  and  (b)  for  the  limit  that 

E  (s)  =  0  and  Var  (s)  =  0 


Hence,  the  limit  is  equal  to  zero  (with  probability  one).  From  this,  it  follows  for 
xt  =  pi  +  st  that  the  arithmetic  mean  of  xt  converges  in  mean  square  to  the  true 
expected  value,  /z.  In  the  literature,  this  fact  is  also  known  as  the  “law  of  large 
numbers”.  ■ 


Further  Modes  of  Convergence 

Two  weaker  concepts  of  convergence  can  be  defined  via  probability  statements. 
First,  we  say  {Xn}  converges  in  probability  to  X  if  it  holds  for  arbitrary  s  >  0  that: 

lim  P(\Xn-X\  >  e)  =  0. 

/?— >oo 
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Symbolically,  this  is  denoted  as 

p 

Xn  ->  X  forn^oo. 

Secondly,  one  talks  about  convergence  in  distribution  of  {Xn}  to  X  if  it  holds  for 
all  points  i  G  Mat  which  the  distribution  function  Fn(x )  of  Xn  is  continuous,  that 
Fn(x )  tends  to  the  distribution  function  F(x)  of  X: 

lim  Fn(x )  =  lim  P(Xn  <  x)  =  P(X  <  x)  =  F(x)  . 

n—>oo  n—>o o 

The  word  “distribution”  suggests  the  symbolic  notation: 

d 

Xn  ->  X  for  n  ->  oo  . 

From  Grimmett  and  Stirzaker  (2001 ,  p.  3 10)  or  Potscher  and  Prucha  (2001 ,  Theorem 
5  and  9)  we  adopt  the  following  results. 

Lemma  8.3  (Implications  of  Convergence)  The  following  implications  hold,  n  -> 
oo. 

(a)  Convergence  in  mean  square  implies  convergence  in  probability: 

(Xn  4  X)  =►  (Xn  4  X)  . 

(b)  Convergence  in  probability  implies  convergence  in  distribution: 

(Xn  4  X)  =►  (Xn  4  X)  . 

In  general,  the  converse  of  Lemma  8.3(a)  or  (b)  does  not  hold.  However,  if  X  —  c 
is  a  constant,  then  convergence  in  probability  and  convergence  in  distribution  are 
equivalent,  see  Grimmett  and  Stirzaker  (2001,  p.  3 10)  or  Potscher  and  Prucha  (2001, 
Theorem  10). 


8.5  Problems  and  Solutions 

Problems 

8. 1  Prove  Proposition  8.1. 

Hint:  Use  Lemma  8.2. 


3 In  particular  in  econometrics,  one  often  writes  alternatively  plimX,,  =  X  as  n  — ^  oo. 
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8.2  Determine  the  expected  value  from  Proposition  8.3. 

8.3  Determine  the  variance  from  Proposition  8.3. 

8.4  Calculate  the  variance  from  Corollary  8.1(b). 

8.5  Calculate  the  variance  from  Corollary  8.1(c)  for  the  special  case  of  c  —  0. 

8.6  Show  that  the  variance  from  Corollary  8.1(c)  is  positive. 

8.7  Prove  Proposition  8.4. 

8.8  Prove  Lemma  8.1. 

8.9  Prove  Lemma  8.2. 


Solutions 


8.1  Analogously  to  the  partition  (8.1)  and  the  Riemann  sum  Rn  from  (8.4),  we  define 
for  arbitrary  m  with  m  oo: 


Pm  ([0,  t])  : 


0  =  To  <  . . .  <  rm  =  t ,  max  (r7  —  r7_i )  0  , 

1  <j<m  y  7 


Rm  =  'Y2f(r*)x(r*)  (rj  -  rj- 0  -  rJ  e  [fj- 1  -  O']  • 

7=1 

In  order  to  apply  the  existence  criterion  from  Lemma  8.2(b),  we  formulate  the 
product  of  the  two  Riemann  sums  as  follows: 

n  m 

RnRm  -  E  Y.f{S*V{,^X{S*'X{,P  (0  -  -  Si- 1)  . 

i=  1  7=1 

Hence,  the  Riemann  integral  as  limit  of  Rn  exists  if  and  only  if  E(R7?  Rm )  converges. 
Further, 


n 


m 


E {Rn  R,n )  = 


f{s*)f{rj)  E  (X(sf)X(r*))  ( rj  -  r;_l)(,s,  -  5,-0  , 


i=  1  7=1 


converges  if  and  only  if  the  ordinary  Riemann  double  integral 


mm  E  (X(s)  X(r» 


drds  <  oo 
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exists.  The  Cauchy  criterion  from  Lemma  8.2  therefore  amounts  to  a  proof  of 
Proposition  8.1. 

8.2  For  the  WP  it  holds  that 

E(W(f))  =  0  and  E  (W(s)  W(r))  =  min^,  r)  . 

The  minimum  function  is  continuous  in  both  the  arguments.  If  /  is  a  continuous 
function  as  well,  then  we  know,  with  the  considerations  following  Proposition  8.1, 
that  the  stochastic  Riemann  integral 

t 

f(s )  W(s)  ds 

exists.  In  order  to  calculate  the  expected  value,  Fubini’s  theorem  will  be  applied. 
For  this  purpose,  we  check  that 

f  E(|/(s)  W(s) |)  ds  =  f  |/0)l  E(|W(s)|)  ds 
Jo  Jo 

<  max 

0  <s<t 

is  finite.  The  bound  is  based  on  the  continuity  and  hence  the  finiteness  of  /.  The 
integral 


1/(5)  I  f  E(|W(s)|)<fc 
Jo 


t 

E  (|  W(V)|)  ds 

was  determined  in  Example  8.1  for  t  fixed  to  be  finite.  As  the  WP  is  continuous, 
Proposition  8.2  can  be  applied.  According  to  this,  it  holds  that: 


E  j  f(s)W(s)ds^j  —  J  f(s)E(W(s))ds  —  0 


8.3  Let  us  denote  the  Riemann  integral  by  Y ( t ) : 


no 


=/■ 


f(s)  W(s)ds. 


We  have  already  shown  that  E(T(t))  =  0.  Hence,  it  follows  that 


Var(y(0)  =  E  [r2(f(] 


=  E 


[/' 


f 


f(s )  W(s)ds  /  /(r)  W(r)dr 
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f(r)  W(r)drjf(s )  W(s)ds 
f{r)f(s )  W(r)W(s)dr ^  ds 


By  applying  Fubini’s  theorem  twice,  we  obtain: 


E  [E2(0]  =  f 

Jo 


E  I  /  f(r)f{s )  W(r)W(s)dr  J  ds 


nE (f(r)f(s)  W (r)W {s))dr ds 

_ 


=  f  [  f(r)f(s)E(W(r)W(s))drds 
Jo  Jo 

=  /  /  f(r)f(s)min(r,s)drds , 

Jo  Jo 


which  is  the  requested  result. 
8.4  Let  us  define 


Y(t)  =  W(  1)  -  /  WO)  ^ 


/' 


with  E(F0))  =  0.  Then,  it  holds  that 
Var(Y(0)  =  E(Y2(i)) 


=  Var(W(l))  +  Var 


=  1  +  3“ 


f  W  (s)  ds 

lJo 

1 

--2  E(W(l)W(s))ds, 
3  Jo 


-2E 


W(  1)  f  W(s)ds 
Jo 


where  the  variance  from  Corollary  8.1(a)  and  Fubini’s  theorem  were  used.  On  [0, 1] 
it  holds  that: 


E(W(1)  W(s))  =  min(l,s)  =  s. 
Hence,  the  variance  results  as  claimed: 


Var(7(0)  —  1  +  ~  —  2 


1  , 

2 


-ii 


Jo 


1  1 

=  1  +  -  -  1  =  - 
3  3 
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8.5  For  c  —  0  the  claim  reads 


Var  j  s  W^dsJ  =  <j\  —  ^ 


According  to  Proposition  8.3,  the  variance  results  as  a  double  integral  for  f(s)  =  s, 


1  n  1 


a 


R 


—  rs  min(r,  s)  dr  ds , 

Jo  Jo 


where  the  inner  integral  is  appropriately  decomposed  to  facilitate  the  calculation: 


/  /  rsmin(r,  s)  dr  ds 

Jo  Jo 


1  r  rs 


-j- 

Jo  Jo 


0  LJO 
1  r  rs 


l 


r  min(r,  s)  dr  r  min(r,  51)  dr 


ds 


=ls[l 

-L 


r2  dr  +  /  rs  dr 


ds 


l  r  3 

s  s 


-£( 


3+2<'-S> 


1/4  2  4 

S  S  S 


ds 


- 1 - I  ds 

3  2  2 


-a 


1  / 


- ds 

2  6 


3  5  -i  • 

SJ  S 


6  30 

1  1 


Jo 


6  30  30 


This  corresponds  to  the  claimed  result. 


8.6  We  consider  the  numerator  of  a 


R’ 


n(c)  —  8  —  25c  +  20c2, 

and  show  that  it  does  not  have  any  real  zeros.  Setting  n(c)  —  0  yields: 


^1, 2  = 


25  ±  V252  -  4  -  20  -8 
2-20 

25  ±  vr-[5 
40 

25  ±  W 15  , 

- - — ,  i2  =  - 1 

40 
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Thus,  no  real  zeros  exist.  Consequently,  since  n(0)  =  8  >  0  the  in  c  continuous 
function  n(c)  cannot  be  zero  or  negative;  the  same  holds  true  for  o|,  which  proves 
the  claim. 

Of  course,  an  alternative  proof  consists  in  determining  the  real  extrema  of  n(c). 
There  exists  an  absolute  minimum  and  this  turns  out  to  be  positive. 

8.7  If  the  Riemann  integral  is  denoted  by  Y(t ),  then  it  holds,  as  for  the  derivation  of 
the  variance,  that: 


E(T(0  Y(t  +  h)) 


f(r )  min(r,  s)dr 
ds. 


ds 


Partial  integration  yields  the  following  relation: 

ns  ns 

/  f(r)rdr  —  F(s)s  —  /  F(r)dr. 

Jo  Jo 

By  plugging  in  we  obtain  the  claim. 

8.8  Proof  of  (a):  By  bounding  the  difference  of  the  two  expected  values  by  means 
of  the  Cauchy- Schwarz  inequality  (2.5)  one  can  immediately  tell  that  this  difference 
tends  to  zero  in  the  case  of  convergence  in  mean  square: 


\E(Xn)-E(X)\  =  \E(Xn-X)\ 

<  Ve Wn  -  X)2]  =  VMSE(X„,X) 


0  ,  n  oo  . 


Proof  of  (b):  The  simple  trick 


X2-X2  =  (Xn-X)2  +  2(Xn-X)X 


yields  upon  expectation: 


E(X„2)  -  E(X2)  -  E  [(X„  -  X)2]  +  2  E  [(X„  -  X)  X] 

<  E  [(X„  -  X)2]  +  2  |E[(X„-X)X]| 

<  E[(X„  -X)2]  +  2  VE [(X„  - X)2]  VE(X2) 
=  MSE(X„,  X)  +  2  VMSE(X„,X)  Ve(X2) 
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where  again  the  Cauchy- Schwarz  inequality  (2.5)  was  used. 

Proof  of  (c):  As  is  well  known,  convergence  in  mean  square  implies  convergence 
in  distribution,  see  Lemma  8.3.  The  latter  is  equivalent  to  the  fact  that  the 
characteristic  function  <pn(u )  of  Xn  tends  to  the  characteristic  function  <p(u)  of  X.4 
Now  we  show:  If  4>n(u)  belongs  to  a  Gaussian  distribution,  then  this  holds  for  cj)(u) 
as  well.  Hence,  (a)  and  (b)  in  combination  with  the  premise  of  a  Gaussian  sequence 
{ x„ }  imply: 


</>„(w)  =  exP  |  iuE(Xn)  - 


|  =  <H«) 

for  n  ->  oo.  Thus,  the  characteristic  function  of  X  as  n  ->  oo  is  that  of  a  Gaussian 
distribution  as  well. 

8.9  Proof  of  (a):  Elementarily,  it  can  be  shown  that  the  Cauchy  criterion  follows 
from  convergence  in  mean  square: 


exp  \  i  uE(X)  — 


w2Var(X) 


m2V ar(X„)  | 


Xn )2]  -  E  T ((Xm 


X)  +  (X  -  Xn ))2 


=  E  [(X„,  -  X)2]  +  E  [(X  -  X„)2] 

+  2  E  [(Xm  —  X)  (X  —  X„)] 

<  E  [(Xm  -  X)2]  +  E  [(X  -  X„)2] 

+2  ?E  [(Xm  -  X)2]  Ve  [(X  -  X„)2] 
=  MSE  (X„,,X)  +  MSE  (X„,X) 

+2  VMSE  (Xm, X)  v/MSE  (X„, X) , 


where  the  bounding  is  again  based  on  the  Cauchy- Schwarz  inequality  (2.5).  It  is 
somewhat  more  involved  that,  inversely,  the  condition  from  (a)  implies  convergence 
in  mean  square  as  well.  For  the  proof,  we  refer  e.g.  to  the  exposition  on  Hilbert 
spaces  in  Brockwell  and  Davis  (1991,  Ch.  2). 


4See  e.g.  sections  5.7  through  5.10  in  Grimmett  and  Stirzaker  (2001)  for  an  introduction  to  the 
theory  and  application  of  characteristic  functions.  In  particular,  it  holds  for  the  characteristic 
function  of  a  random  variable  with  a  Gaussian  distribution,  Y  ~  Af(/x,  a2),  that: 

?! 


(j)y{u)  =  exp 


III  /JL  — 


9 


i2  =  —  1 ,  mGI, 
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Proof  of  (b):  If  we  take  (a)  for  granted,  the  proof  is  simple.  Due  to 

E  [(Xm  -  Xnf ]  =  E(X2m)  +  E(X2)  -  2E(XmXn), 

one  can  immediately  tell  that  the  condition  from  (a)  implies: 

E(X2)  +  E(X2)  ,  - 

E(XmXn)  J  =  E(X2) . 


Inversely,  from  the  condition  from  (b)  it  naturally  follows  that 

E  [(X„,  -  Xn )2]  =  E(X;n)  +  E(X2)  -  2  E(Xm  X„) 

— >  c  c  —  2c  —  0. 


This  completes  the  proof. 
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Stieltjes  Integrals 


9.1  Summary 

Below,  we  will  encounter  Riemann-Stieltjes  integrals  (or  more  briefly:  Stieltjes 
integrals)  as  solutions  of  certain  stochastic  differential  equations.  They  can  be 
reduced  to  the  sum  of  a  Riemann  integral  and  a  multiple  of  the  Wiener  process. 
Stieltjes  integrals  are  again  Gaussian.  As  an  example  we  consider  the  Ornstein- 
Uhlenbeck  process  which  is  defined  by  a  Stieltjes  integral  and  which  will  be  dealt 
with  in  detail  in  the  chapter  on  interest  rate  models. 


9.2  Definition  and  Partial  Integration 

As  a  first  step  towards  the  Ito  integral,  we  define  Stieltjes  integrals  which  can  be 
reduced  to  Riemann  integrals  by  integration  by  parts. 


Definition 

The  Riemann-Stieltjes  integral  (or  Stieltjes  integral),  as  it  is  considered  here, 
integrates  over  a  deterministic  function /(y).  Nevertheless,  the  Stieltjes  integral  is 
random  as  it  is  integrated  with  respect  to  the  stochastic  Wiener  process  W(s).  In 
order  to  understand  what  is  meant  by  this,  we  recall  the  partition  (8.1): 

Pn  ([0,  t])  :  0  =  so  <  s\  <...<  sn  =  t , 


Thomas  J.  Stieltjes  lived  from  1856  to  1894.  The  Dutch  mathematician  generalized  the  concept 
of  integration  by  Riemann. 
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with  sf  e  [^-i ,  si\.  Hence,  the  Riemann-Stieltjes  sum  is  defined  as 


RS „  =  £>*)  (W(s,-)  -  Wfe-i)) .  (9.1) 

i=  1 

If  an  expression  well-defined  in  mean  square  follows  from  this  for  n  oo  under 
(8.2),  then  we  define  it  as  a  Stieltjes  integral  with  the  obvious  notation 

RSn  ^  I  f(s)dW(s). 

Jo 

Note  that  dW{s)  does  not  stand  for  the  derivative  of  the  Wiener  process  as  it  does 
not  exist.  It  is  just  a  common  symbolic  notation. 

Iff  is  continuously  differentiable,  then  the  existence  of  the  Stieltjes  integral  is 
guaranteed,  see  Soong  (1973,  Theorem  4.5.2). 


Integration  by  Parts 

If  /  is  continuously  differentiable,  then  the  Stieltjes  integral  can  be  expressed  by  a 
Riemann  integral  and  the  WR  This  relation  is  also  known  as  integration  by  parts.  In 
Chap.  11  we  will  understand  that  it  is  a  special  case  of  Ito’s  lemma,  which  is  why 
we  do  not  have  to  concern  ourselves  with  a  proof  of  Proposition  9.1  at  this  point. 

Proposition  9.1  (Stieltjes  Integral;  Integration  by  Parts)  For  a  continuously 
differentiable,  deterministic  function  f  we  have  that 

(a)  the  Stieltjes  sum  from  (9.1)  converges  in  mean  square  if  it  holds  that  max(s;  — 

Si-\)  0, 

(b)  and 


f  f(s )  dW(s)  =  [/'(.S')  W(s)] ‘  -  f  W (s)  df(s) 
Jo  Jo 

—  fit)  W(t)  —  i  W(s)f\s)ds 

Jo 


where  the  last  equality  holds  with  probability  one. 


2  We  call  a  function  continuously  differentiable  if  it  has  a  continuous  first  order  derivative. 

3Remember  that  we  assumed  P(W(0)  =  0)  =  1,  which  justifies  the  last  statement.  Whenever  we 
have  equalities  in  a  stochastic  setting,  they  are  typically  understood  to  hold  with  probability  one 
for  the  rest  of  the  book. 
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The  result  from  (b)  corresponds  to  the  familiar  rule  of  partial  integration.  As  a 
refresher,  we  write  this  rule  for  two  deterministic  functions/  and  g : 


f  f(s)  g'(s)  ds  =  \f(s)  g(s)]o  -  f  g(s)f(s)ds 
Jo  Jo 


Hence,  this  is  the  integral  form  of  the  product  rule  of  differentiation: 


d[f(s)  g(X>] 

ds 


=  f(s)g(s)+g'(s)f(s), 


or 


(9.2) 


d  [/(  v)  g(s)]  =  g(s)  df(s)  +f(s)  dg(s) , 


or 


g(s)  df(s )  +  f  f(s)  dg(s)  . 

Jo 

Therefore,  one  can  make  a  mental  note  of  Proposition  9.1  (b)  by  the  well-known 
partial  integration  from  (9.2). 

Example  9.1  (Corollary)  As  an  application  of  Proposition  9.1  we  consider 
Riemann-Stieltjes  integrals  for  three  particularly  simple  functions.  We  will 
encounter  these  relations  repeatedly.  The  proof  amounts  to  a  simple  exercise  in 
substitution.  It  holds 

(a)  for  the  identity  function/^)  =  s: 

f  sdW(s )  =  tW(t)  —  f  W(s)ds ; 

Jo  Jo 


lf(s)  g(s)]o  =  f 

Jo 


(b)  for f(s)  —  l  —  s: 


f 


(1  -  s)  dW(s)  =  (1  -  t)  W(t)  +  /  W(s)  ds ; 


/' 


(c)  for  the  constant  function/^)  =  1 : 


t 

dW(s)  =  W(t)  . 

In  (c)  we  again  observe  a  formal  analogy  of  the  WP  with  the  random  walk.  Just  like 
the  latter  is  defined  as  the  sum  over  the  past  of  a  pure  random  process,  see  (1.8),  the 
WP  is  the  integral  of  its  past  independent  increments.  ■ 
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9.3  Gaussian  Distribution  and  Autocovariances 

The  reduction  of  Stieltjes  integrals  to  Riemann  integrals  suggests  that  there  are 
Gaussian  processes  hiding  behind  them.  In  fact,  it  holds  that  all  Stieltjes  integrals 
follow  Gaussian  distributions  with  expectation  zero. 


Gaussian  Distribution 


The  Gaussian  distribution  itself  is  obvious:  The  Riemann- Stieltjes  sum  from  (9.1) 
is,  as  the  sum  of  multivariate  Gaussian  random  variables,  Gaussian  as  well.  Then, 
this  also  holds  for  the  limit  of  the  sum  due  to  Lemma  8.1.  The  expected  value 
is  zero  due  to  Propositions  9.1(b)  and  8.3.  The  variance  results  as  a  special  case 
of  the  autocovariance  given  in  Proposition  9.3.  Hence,  we  obtain  the  following 
proposition. 


Proposition  9.2  (Normality  of  Stieltjes  integrals)  For  a  continuously  differen¬ 
tiable,  deterministic  function  f ,  it  holds  that 


f 


f(s)dW(s ) 


The  variance  of  the  Stieltjes  integral  is  well  motivated  as  follows.  For  the  variance 
of  the  Riemann-Stieltjes  sum, 


Var  £/0*)  (W(s,)  -  )) 

\i=l 

it  follows  for  >  oo,  due  to  the  independence  of  the  increments  of  the  WP,  that: 

n  n 

£/2(V)  Var  (W  (s,)  -  W  (*•_,))  =  £/2(tf)  (s,  -  5,-0 

i=  1  i=  1 

f2(s)  ds . 

The  convergence  takes  place  as  f2  is  continuous  and  thus  Riemann-integrable. 
Hence,  for  n  oo  the  expression  from  Proposition  9.2  is  obtained. 

Let  us  consider  the  integrals  from  Example  9.1  and  calculate  the  variances  for 
t  —  1  (see  Problem  9.1). 
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Example  9.2  ( Corollary )  For  the  functions  from  Example  9.1  it  holds: 

(a)  for  the  identity  function/^)  =  s: 

l 

j  s  dW (s)  ~  A/"(0, 1/3); 

0 


(b)  for f(s)  —  l  —  s: 


l 

j (1  -s)dW(s)  ~  A''(0. 1/3); 
0 


(c)  for  the  constant  function/^)  =  1 : 

W(t)  =  [  dW  (s)  ~  J\T (0,  t) .  ■ 

Jo 


Autocovariance  Function 

As  a  generalization  of  the  variance,  an  expression  for  the  covariance  is  to  be  found. 
Hence,  let  us  define  the  process  Yip)  —  f^f{s)dW{s).  The  autocovariance  of  Yip) 
and  Yip  +  h)  with  h  >  0  can  be  well  justified  if  one  takes  into  account  that 
the  increments  dW  ip)  of  the  WP  are  stochastically  independent  provided  they  do 
not  overlap.  Therefore,  one  should  expect  f^f{s)dW(s)  and  ft+h f(r)dW(r)  to  be 
uncorrelated: 


If  this  is  true,  then,  due  to 


r  nt  nt+h 

J  f(s)dW(s)  J  f(r)dW(r) 


=  0. 


nt~\~h  n  t  r*t~\~h 

/  f(r)dW(r)  —  /  f(r)dW(r)  +  /  f(r)dW(r) 

Jo  Jo  Jt 


the  following  result  is  obtained: 


/*t  pt~\~h 

/  fO)dW(s)  /  f(r)dW(r ) 

Jo  Jo 


=  E 


f  f(s)dW(s)  f  f(r)dW(r) 
Uo  Jo 


—  Var  I  j  f(s)dW(s)  )  . 
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Therefore,  for  an  arbitrary  h  >  0  the  autocovariance  coincides  with  the  variance  in 
t.  In  fact,  this  result  can  be  verified  more  rigorously  (see  Problem  9.5). 


Proposition  9.3  (Autocovariance  of  Stieltjes  Integrals)  For  a  continuously  dif¬ 
ferentiable,  deterministic  function  f  it  holds  that 


E 


f  f(s)dW(s )  f 
Jo  Jo 


t+h 

f(s)dW(s ) 


with  h  >  0. 

Of  course,  for  h  —  0  the  variance  from  Proposition  9.2  is  obtained. 

Example  9.3  ( Auto  covariance  of  the  WP )  As  an  example,  let  us  consider /(y)  =  1 
with 


W(t)  =  f  dW(s). 

Jo 

Then,  it  follows  for  h  >  0: 

E (W(t)W(t  +  h))  =  f  ds  —  t  —  min(t,  t  +  h). 

Jo 

Trivially,  this  just  reproduces  the  autocovariance  structure  of  the  Wiener  process 
already  known  from  (7.4).  ■ 


9.4  Standard  Ornstein-Uhlenbeck  Process 

The  so-called  Ornstein-Uhlenbeck  process  has  been  introduced  in  a  publication  by 
the  physicists  Ornstein  and  Uhlenbeck  in  1930. 


Definition 

We  define  the  Ornstein-Uhlenbeck  process  (OUP)  with  starting  value  Xc(0)  =  0 
for  an  arbitrary  real  c  as  a  Stieltjes  integral, 

Xc(t )  :=  ect  [  e~cs  dW(s),  t  >  0,  Xc(0)  =  0  .  (9.3) 

Jo 

For  c  —  0  in  (9.3)  the  Wiener  process,  X0(0  =  W(t),  is  obtained.  More  precisely, 
Xc(t)  from  (9.3)  is  a  standard  OUP;  a  generalization  will  be  offered  in  the  chapter 
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on  interest  rate  dynamics.  By  definition,  it  holds  that: 


Xc(t+  1)  =  ec,ec 


[f 


e~csdW(s)  + 
>/+ 1 


/ 


H- 1 


e~csdW(s) 


/i-ri 

e~csdW(s) 

—  ecXc(t )  +  s(t  +  1), 


where  eft  +  1)  was  defined  implicitly.  Note  that  the  increments  dW(s)  from  t  on  in 
eft  +  1)  are  independent  of  the  increments  up  to  t  as  they  appear  in  Xc(t).  Hence, 
the  OUP  is  a  continuous  counterpart  of  the  AR(1)  process  from  Chap.  3  where  the 
autoregressive  parameter  is  denoted  by  ec .  For  c  <  0  this  parameter  is  less  than  one, 
such  that  in  this  case  we  expect  a  stable  adjustment  or,  in  a  way,  a  quasi- stationary 
behavior.  This  will  be  reflected  by  the  behavior  of  the  variance  and  the  covariance 
function  which  are  given,  among  others,  in  the  following  proposition. 


Properties 

The  proof  of  Proposition  9.4  will  be  given  in  an  exercise  problem.  It  comprises  an 
application  of  Propositions  9.1,  9.2  and  9.3. 

Proposition  9.4  (Ornstein-Uhlenbeck  Process)  It  holds  for  the  Ornstein- 
Uhlenbeck  process  from  (9.3)  that: 

(a)  Xc(t)  =  W(t)  +  cect  f  e~cs  W(s)  ds  , 

Jo 

( b )  Xc(t)  ~  ,V(0,  (e2ct  -  l)/2 c) , 

(c)  E(Xc(t)Xc(t  +  h))  =  echVar(Xc(t))  , 


where  h  >  0. 

Statement  (a)  establishes  the  usual  relation  between  Stieltjes  and  Riemann  integrals 
and,  seen  individually,  it  is  not  that  thrilling.  As  for  c  =  0  the  OUP  coincides  with 
the  WP,  it  is  interesting  to  examine  the  variance  from  (b)  for  c  ->  0.  L’ Hospital’s 
rule  yields: 


-  1 

2c 


lim 

c-K) 


2  te 


let 


c 


c 


—  t. 
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Hence,  for  c  ->  0  the  variance  of  the  WP  is  embedded  in  (b).  The  covariance  from 
(c)  allows  for  determining  the  autocorrelation: 


corr(Xc(t),Xc(t  +  h)) 


echNdx{Xc(t)) 

sfVMXM)  yVar  (Xc(t  +  h )) 


ch  y/Var(Xc(t)) 

yVar(Xc(f  +  ft)) 


Now,  let  us  assume  that  c  <  0.  Then  it  holds  for  t  growing  that: 


lim  Var(Xc(f))  = - >  0. 

*-►0 o  2c 


Accordingly,  it  holds  for  the  autocorrelation  that: 


lim  corr(Xc(t),Xc(t  +  h ))  =  ech,  c  <  0. 


Thus,  for  c  <  0  we  obtain  the  “asymptotically  stationary”  case  with  asymptotically 
constant  variance  and  an  autocorrelation  being  asymptotically  dependent  on  the  lag 
h  only.  Thereby,  the  autocorrelation  results  as  the  h- th  power  of  the  “autoregressive 
parameter”  a  —  ec .  With  h  growing,  the  autocovariance  decays  gradually.  This  finds 
its  counterpart  in  the  discrete-time  AR(1)  process.  Just  as  the  random  walk  arises 
from  the  AR(1)  process  with  the  parameter  value  one,  the  WP  with  c  —  0,  i.e. 
a  —  e°  —  1 ,  is  the  corresponding  special  case  of  the  OUP.  Hence,  we  can  definitely 
consider  the  OUP  as  a  continuous-time  analog  to  the  AR(1)  process. 


Simulation 

The  theoretical  properties  of  the  process  for  c  <  0  can  be  illustrated  graphically. 
In  Fig.  9.1  the  simulated  paths  of  two  parameter  constellations  are  shown.  It  can 
be  observed  that  the  process  oscillates  about  the  zero  line  where  the  variance  or 
the  deviation  from  zero  for  c  —  —0.1  is  much  larger4  than  in  the  case  c  =  —0.9. 
This  is  clear  against  the  background  of  (b)  from  Proposition  9.4  in  which  the  first 
moment  and  the  variance  are  given:  The  expected  value  is  zero  and  the  variance 
decreases  with  the  absolute  value  of  c  increasing.  The  positive  autocorrelation  (cf. 
Proposition  9.4(c))  is  obvious  as  well:  Positive  values  tend  to  be  followed  by  positive 
values  and  the  inverse  holds  for  negative  observations.  The  closer  to  zero  c  is,  the 
stronger  the  autocorrelation  gets.  That  is  why  the  graph  for  c  —  —0.1  is  strongly 


4If  the  arithmetic  mean  of  the  1000  observations  of  this  time  series  is  calculated,  then  by  —0.72344 
a  notably  negative  number  is  obtained  although  the  theoretical  expected  value  is  zero.  Details  on 
the  simulation  of  OUP  paths  are  to  follow  in  Sect.  13.2. 
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0  5  10  15  20 

c—  -  0.1 


Fig.  9.1  Standard  Omstein-Uhlenbeck  processes 


determined  by  the  “local”  trend  and  does  not  cross  the  zero  line  for  longer  time 
spans  while  for  c  —  —0.9  the  force  which  pulls  the  observations  back  to  the  zero 
line  is  more  effective  such  that  the  graph  looks  “more  stationary”  for  c  —  —0.9. 


9.5  Problems  and  Solutions 


Problems 

9.1  Calculate  the  variances  from  Example  9.2. 

9.2  Verify  the  Gaussian  distribution  from  Corollary  8.1(b). 

9.3  Verify  the  following  equality  (with  probability  1): 


[  s2  dW(s)  =  t2  W(t )  -2  f  s  W(s)  ds 

Jo  Jo 


9.4  Determine  the  variance  of  the  process  X(t)  with 
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9.5  Prove  Proposition  9.3. 

9.6  Prove  (a)  from  Proposition  9.4. 

9.7  Show  (b)  from  Proposition  9.4. 

9.8  Prove  (c)  from  Proposition  9.4. 


Solutions 


9.1  From  Proposition  9.2  it  obviously  follows  for  (a)  that: 


/' 


s2ds  — 


rs3^ 


0 


1 

3 


Equally,  one  shows  (b): 


L 


(1  —  s)2ds  — 


(1  -s) 


3  i 


Jo 


1 

3 


Finally,  the  result  from  (c)  is  known  anyway. 


9.2  The  result  follows  from  the  examples  of  this  chapter.  From  Example  9.1(a)  we 
obtain  for  t  —  1 : 


W(  1)-  f  W(s)ds  =  I  sdW(s). 

Due  to  Example  9.2(a)  the  claim  is  verified. 

9.3  This  is  a  straightforward  application  of  Proposition  9.1.  With  f(s)  =  s 2  and 
f(s)  =  2s  the  claim  is  established. 

9.4  From  Proposition  9.2  with/(s)  =  s2  it  follows  for  the  variance 


Var 


)-[ 


s2dW(s)  )  =  /  s4ds  = 


1  , 

-  v 
5 


o 


9.5  With  Y(t)  —  ftQ>f{s)dW{s)  we  know  from  Proposition  9.1  that: 


7(0  —  fit)  W(t)  —  f  f'(s)W(s)ds. 

Jo 
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Hence,  the  covariance  results  as 


E  (Y(t)Y(t  +  /*))=A-£-C  +  D 

where  the  expressions  on  the  right-hand  side  are  defined  by  multiplying  Y(t)  and 
Y(t  +  h).  Now,  we  consider  them  one  by  one. 

For  A  we  obtain  immediately: 

A  =  E[f(t)f(t  +  h)W(t)W(t  +  h)] 

—  f  if)  f{t  +  h )  min(£,  t  +  h) 

=  fit)  f(t  +  h)t. 


By  Fubini’s  theorem  it  holds  for  B  that: 


B  =  E 


fit  +  h)  f  ffs)W(s)W(t  +  h)ds 

Jo 


=  f{t  +  h)  f  f  {s )  min (.s1,  t  +  h)ds 

Jo  ' 

=  fit  +  h)  f  f(s)sds. 

Jo 


Integration  by  parts  in  the  following  form, 


f  ff)rdr  —  F(t)t  —  f  Fir) dr  with  Fr  —  /, 

Jo  Jo 


applied  t of  yields: 


(9.4) 


B  =  f(t  +  h)[f(t)t  -  F(t)  +  F( 0)], 

where  Fis)  denotes  the  antiderivative  of f  is).  In  the  same  way,  we  obtain 


C  =  E 


rt-\-h 

fit )  /  f(s)W(s)W(t)ds 

Jo 

nt~\-h 

=  fif  /  fis)  minis,  t)ds 

Jo 


o 

r  rt 


=  fit) 


/t  n  t-\-h 

f’(s)sds  +  J  f  (s)tds 


fit )  [fit)t  ~  Fit )  +  F( 0)  +  m  +  h)  -fit))} 
fit)  [F(0)  -  Fit)  +  tf(t  +  K )]  . 
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For  the  fourth  expression  Proposition  8.4  provides  us  with  f  instead  off : 


D  =  E 


nt  nt~\~h 

/  f'{s)W(s)ds  /  f'(r)W(r)dr 

Jo  Jo 


=  f  f'(s )  [/(.S'), S'  -  (F(.s)  -  F(0))  +  ,s-(/(?  +  h)  -/(,?))]  ds 
JO 

=  f  f'0)dsF( 0)  -  f  f'(s)F(s)ds  +f(t  +  h)  f  sf'(s)ds. 
Jo  Jo  Jo 

In  addition  to  (9.4),  we  apply  integration  by  parts  in  the  form  of 


/v 


(s)F(s)ds  =f(t)F(t )  — /(0)F(0)  -  /  /2(s)<is 


/ 


Then  it  holds  that: 


D  = 


(/(0  -/(0))F(0) -f(t)F{t)  +f(0)F(0)  +  f  f2(s)ds 

Jo 


+/(t  +  (f(t)t  —  F(t)  +  F(0)) 

-2 , 


=  f  fz(s)ds  +  (F(0)  -  F(t))(f(t)  +f(t  +  h))  +f(t)f(t  +  h)t. 
Jo 


If  we  assemble  the  terms,  then  we  obtain  the  autocovariance  function  in  the  desired 
form: 


r  rt  rt+h 

/  f(s)dW(s)  /  f(r)dW(r) 
Uo  Jo 


—  A  —  B  —  C  +  D—  /  f(s)ds 


-l 


9.6  We  use  Proposition  9.1  with  f(s)  —  e  cs: 


[  e~csdW(s)  =  ^“C'W(0  +  c  f  e~csW(s)  ds. 

Jo  Jo 


Multiplying  by  ect  yields 


st 


e-  f  e~csdW(s )  =  W(t)  +  cect  f  e~csW(s)ds. 

Jo  Jo 


On  the  left-hand  side,  we  have  the  OUP  Xc(t )  by  definition  which  was  to  be  verified. 
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9.7  Due  to  Proposition  9.2  the  OUP  is  Gaussian  with  expectation  zero  and  variance 


Var(Xc(?))  =  e 


2  Ct 


f 


—2  cs 


ds 


—  e 


2  ct 


e~2cs  -1  ’ 


—  e 


2  ct 


.  -2 c  jo 

— let  _  i 


-2  c 

1  - 


-2c 


This  is  equal  to  the  claimed  variance. 

9.8  As  for  the  derivation  of  the  variance,  we  use 


/ 


t  |  _  fj—2ct 

~2csds  -  - 


2c 


From  Proposition  9.3  we  know  that  this  is  also  the  expression  for  the  autocovariance 
of  the  Stieltjes  integrals  (h  >  0): 


E 


e~csdW(s) 


1  -  c“2rt 
2c 


Hence,  we  obtain  for  the  OUP: 


1  —  e~2ct 

E  (Xc(t)Xc(t  +  h))  =  eclec(l+h) - 

2c 


e2ct  -  1 
2c 


=  ed,Var  (Xc(t)). 


Reference 

Soong,  T.  T.  (1973).  Random  differential  equations  in  science  and  engineering.  New  York: 
Academic  Press. 
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10.1  Summary 

Kiyoshi  Ito  (1915-2008)  was  awarded  the  inaugural  Gauss  Prize  by  the  International 
Mathematical  Union  in  2006.  Stochastic  integration  in  the  narrow  sense  can  be 
traced  back  to  his  early  work  published  in  Japanese  in  the  forties  of  the  last 
century.  We  precede  the  general  definition  of  the  Ito  integral  with  a  special  case. 
Concluding,  we  discuss  the  (quadratic)  variation  of  a  process  without  which  a  sound 
understanding  of  Ito’s  lemma  will  not  be  possible. 


10.2  A  Special  Case 

We  start  with  a  special  case  of  Ito  integration,  so  to  speak  the  mother  of  all  stochastic 
integrals.  Thereby  we  will  understand  that,  besides  the  Ito  integral,  infinitely  many 
related  integrals  of  a  similar  structure  exist. 

Problems  with  the  Definition 

The  starting  point  is  again  a  partition 


Pn  ([0,  t])  :  0  =  s0  <  <  . . .  <  sn  =  t, 


that  gets  finer  for  n  growing  since  we  continue  to  maintain  (8.2).  Given  this 
decomposition  of  [0,t],  we  define  analogously  to  the  Riemann-Stieltjes  sum  for 

S*  e  [Si-i, Si): 


n 


(io.i) 


i=l 
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For  n  ->  oo  we  would  like  to  denote  the  limit  as  f g  W(s)  dW(s),  which  looks  like  a 
Stieltjes  integral  of  a  WP  with  respect  to  a  WP.  However,  we  will  realize  that: 

1 .  The  limit  of  S„(IT)  is  not  unique,  but  depends  on  the  choice  of  sf ; 

2.  the  limit  of  Sn(W )  is  not  defined  as  a  Stieltjes  integral. 

As  the  Stieltjes  integral  has  a  unique  limit  independently  of  s*,  the  second  claim 
follows  from  the  first  one.  If  one  chooses  in  particular  the  lower  endpoint  of  the 
interval  as  support,  s*  =  st- 1,  then  this  leads  to  the  Ito  integral.  Hence,  this  special 
case  is  called  the  Ito  sum: 

n 

In(W)  =  J2  W(si-i)  ( W(si )  -  W(ij_i)) .  (10.2) 

i=  1 

The  following  proposition  specifies  the  dependence  on  s* .  The  special  case  y  —  0 
leading  to  the  Ito  integral  will  be  proved  in  Problem  10.1;  the  general  result  is 
established  e.g.  in  Tanaka  (1996,  eq.  (2.40)).  The  convergence  is  again  in  mean 
square. 

Proposition  10.1  (Stochastic  Integrals  in  Mean  Square)  Let  s*  =  (1  —  y)  Si~\  + 
y  Si  with  0  <  y  <  1.  Then  it  holds  for  the  sum  from  (10.1)  with  n  —>  oo  under  (8.2): 

S„(W)  4  I  ( W2(t)-t)+yt . 

Before  we  discuss  two  special  cases  of  Proposition  10.1,  this  striking  result  is  to 
be  somewhat  better  understood.  We  call  it  striking  because  it  is  counter-intuitive 
at  first  glance  that  the  choice  of  sf  should  matter  with  the  intervals  [*sv_i ,  sf)  getting 
narrower  and  narrower  for  n  ->  oo.  To  better  understand  this,  we  temporarily  denote 
the  limit  of  Sn(W)  as  S(y): 


Sn(W)  4  S(y). 


Then  one  observes  immediately: 


S(y)  =  5(0)  +  yt. 


This  means  that  the  variance  of  all  these  stochastic  integrals  5(y)  is  identical,  i.e. 
equal  to  the  variance  of  S( 0).  Hence,  the  choice  of  different  support  points  s*  is  only 
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reflected  in  the  expected  value: 


E(S(y))=  \(HW2(t))-t)  +  yt 

—  y  t . 

However,  this  expected  value  can  be  well  understood  as  for  finite  sums  it  can  be 
shown  that  (see  Problem  10.4): 


E  (Sn(W))  =  yt . 

This  simply  follows  from  the  fact  that  W(sf )  is  not  independent  of  W(si)  —  W(si-\) 
for  y  >  0.  Next,  we  turn  towards  the  case  y  —  0. 


Ito  Integral 

For  y  —  0,  Sn(W )  from  Proposition  10.1  merges  into  In(W )  from  (10.2).  The 
proposition  guarantees  two  different  things:  First,  that  the  limit  of  In(W)  converges 
in  mean  square.  We  call  this  limit  the  Ito  integral  and  write  instead  of  S( 0)  the 
following  integral: 


In(W)  4  [  W (s)  dW (s)  . 

Jo 

Secondly,  the  proposition  yields  an  expression  for  this  Ito  integral: 

1  1  1 

W(s)  dW(s )  =  -  W2(t)  -  - 1 .  (10.3) 

By  the  way,  (10.3)  is  just  the  “stochastified  chain  rule”  for  Wiener  processes  from 
(1.14).  Note  the  analogy  and  the  contrast  to  the  deterministic  case  (with/(0)  =  0): 

f(s)  df(s)  =  l/2(f)  for/(0)  =  0.  (10.4) 

In  Eq.  (10.3)  we  find,  so  to  speak,  the  archetype  of  Ito  calculus,  i.e.  of  stochastic 
calculus  using  Ito’s  lemma.  The  latter  will  be  covered  in  the  next  chapter. 


'in  particular  for  t  =  1,  (10.3)  accomplishes  the  transition  from  (1.11)  to  (1.12)  for  the  Dickey- 
Fuller  distribution. 
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The  moments  of  the  Ito  integral  can  be  determined  by  (10.3).  According  to  this 
equation  it  holds  for  the  expected  value: 


E 


W(s)  dW{s) 


The  variance  of  the  integral  as  well  can  be  calculated  elementarily  : 


Var  /  W(.s)dW(.s) 


=  1 E  (W\t)  -  2 1  W2(t )  +  t1) 

=  I  (3t2  -2 12  +  t2) 

_  <2 
~  2  ’ 


where  the  kurtosis  of  3  for  Gaussian  random  variables  was  used.  Hence,  we  have 
the  first  two  results  of  the  following  proposition 


Proposition  10.2  (Moments  of  fk(s)  <3W(s))  For  I (t)  =  _/Q;  dW(s )  it  holds 

that 


E(I(t))  —  0  and  Var  (I(t))  —  —  , 


and 


tr 

E(I(t )  I(t  +  h))  —  —  for  h  >  0  . 


2An  alternative,  interesting  method  uses  the  fact  that  the  variance  of  a  chi-squared  distributed 
random  variable  equals  twice  its  degrees  of  freedom: 


Var 


I  t2 

-Var  (W2(t))  =  -  Var 


as  it  holds  that  W(f)/  \ft  ~  J\f( 0,  1)  and  therefore 
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As  a  third  result  it  holds  in  Proposition  10.2,  just  as  for  the  Stieltjes  integral,  that  the 
autocovariance  coincides  with  the  variance,  i.e. 

E(/(0 1{t  +  h))  =  Var  (/(*))  ,  h  >  0. 

We  will  prove  this  in  Problem  10.2. 


Stratonovich  Integral 


For  a  reason  that  will  soon  become  evident,  sometimes  a  considered  competitor  of 
the  Ito  integral  is  the  Stratonovich  integral.  It  is  defined  as  the  limit  of  Sn(W )  from 
(10.1)  with  the  midpoints  of  the  intervals  as  s* : 


Sj-l  +  Si 
2 


This  corresponds  to  the  choice  of  y  —  0.5  in  Proposition  10.1.  Let  the  limit  in  mean 
square  be  denoted  as  follows: 


(W(si)  -  Wisi-0) 


where  “9”  does  not  stand  for  the  partial  derivative  but  denotes  the  Stratonovich 
integral  in  contrast  to  the  Ito  integral.  By  the  way,  with  y  =  0.5  Proposition  10.1 
yields: 


1  W2  (t) 

W(s )  9W(s)  =  — —  . 

Hence,  the  Stratonovich  integral  stands  out  due  to  the  fact  that  the  familiar 
integration  rule  known  from  ordinary  calculus  holds  true.  In  differential  notation 
this  rule  can  be  formulated  symbolically  as  follows: 

=  W{t)  dW(t)  . 

This  just  corresponds  to  the  ordinary  chain  rule,  cf.  (10.4).  Although  the  Ito  and  the 
Stratonovich  integral  are  distinguished  from  each  other  only  by  the  choice  of  s*  with 
intervals  getting  shorter  and  shorter,  they  still  have  drastically  different  properties. 
Obviously,  it  holds  for  the  expected  value 


E 


W(s)  dWis) 
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while  the  Ito  integral  is  zero  on  average.  However,  as  aforementioned,  the  variances 
of  f  W(s)dW(s )  and  f  W(s)dW(s )  coincide,  cf.  Problem  10.3  as  well. 


Example  10.1  ( Alternative  Stratonovich  Sum)  Sometimes  the  Stratonovich  integral 
is  defined  as  the  limit  of  the  following  sum,  see  e.g.  Klebaner  (2005,  eq.  (5.65)): 


E 


Wfe-O  +  Wfe) 
2 


(W(s,)  -  W (si—i ))  . 


The  intuition  behind  this  is,  due  to  the  continuity  of  the  WP,  that 


W (s,- 1 )  %  W 


(^) 


In  fact,  it  can  be  shown  more  explicitly  that  the  following  difference  becomes 
negligible  in  mean  square: 


'  Si-l  +  St 


Wist-r)  +  W(Si ) 
2 


For  this  purpose  we  consider  as  the  mean  square  deviation  with  s*  =  ( Si-i  +  si)  /  2: 


MSE(r,  o)  =  e  [<r  -  o)2] 

=  E  [W2(s*)  -  W(s*)  (W(Si-i)  +  W(,s,))] 


+E 


~  W2(s,-i)  +  2  +  WHsd 

4 


Due  to  Sj- 1  <  sf  <  Si  the  familiar  variance  and  covariance  formulas  yield: 

A/rc *  *  I  ^-1  +  2 Si-1  +  Si 

MSE (F,  0)  =  s(  -  Si-i  -  st  H - - - 

_  Sj  -  Si- 1 

4 

As  for  n  — >  oo  the  partition  gets  finer  and  finer,  Si  —  Si- \  ->  0,  the  replacement  of 
W(s *)  by  (IF(^-i)  +  W(si ))  /2  is  asymptotically  well  justified.  ■ 


1 0.3  General  Ito  Integrals 

After  covering  general  Ito  integrals,  we  define  so-called  diffusions  that  we  will  be 
concerned  with  in  the  following  chapters. 
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Definition  and  Moments 

In  order  to  define  general  Ito  integrals,  we  consider  for  a  stochastic  process  X  as  a 
generalization  of  the  sum  In(W): 

n 

In(X)  =  £x(si_i)  (W(sd  -  W(st-\ ))  .  (10.5) 

1=1 

For  the  Ito  integral  two  special  things  apply:  First,  the  lower  endpoint  of  the  interval 
is  chosen  as  s*  —  Si- 1,  i.e.  Xfa- 1);  and  secondly,  we  integrate  with  respect  to 
the  WP,  ( W(si )  —  W(si-\)).  If  X  was  integrated  with  respect  to  another  stochastic 
process,  then  one  would  obtain  even  more  general  stochastic  integrals,  which  we 
are  not  interested  in  here. 

If  X(7)  is  a  process  with  finite  variance  where  the  variance  varies  continuously  in 
the  course  of  time,  and  if  X(t)  only  depends  on  the  past  of  the  WP,  W(s )  with  s  <  t, 
but  not  on  its  future,  then  the  Ito  sum  converges  uniquely  and  independently  of  the 
partition.  The  limit  is  called  Ito  integral  and  is  denoted  as  follows: 

t 

X(s)  dW (s)  . 

The  assumptions  about  X(t)  are  stronger  than  necessary,  however,  they  guarantee 
the  existence  of  the  moments  of  an  Ito  integral,  too.  Similar  assumptions  can  be 
found  in  Klebaner  (2005,  Theorem  4.3)  or  0ksendal  (2003,  Corollary  3.1.7). 

Proposition  10.3  (General  Ito  Integral)  Let  X(s )  be  a  stochastic  process  on  [0,  t\ 
with  two  properties: 

(i)  ptiis)  —  E  (X2(s))  <  oo  is  a  continuous  function, 

(ii)  X(y)  is  independent  of  W (sj)  —  W(si )  with  s  <  Si  <  Sj. 

Then  it  holds  that 


(a)  the  sum  from  (10.5)  converges  in  mean  square: 


Vx(sH)(ff(Ji)-ff(sH))  4  f  X(s)dW(s); 

Jo 


i=  1 


(b)  the  moments  of  the  Ito  integral  are  determined  as: 


E  (jf  X(s)  dW(s)  \  =  0  ,  War  (jf  X(s)  dW(s)\  =  jf  E  (X2(s))  ds 
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Naturally,  for  X(V)  =  W(s )  the  extensively  discussed  example  from  the  previous 
section  is  obtained.  In  particular,  in  (b)  the  moments  from  Proposition  10.2  are 
reproduced,  and  it  holds  for  the  variance  that: 

r  t 2 

(s))  ds  —  l  sds  —  —  . 

Jo  2 

Example  10.2  (Stieltjes  Integral )  Consider  the  special  case  where  X(V)  is  not 
stochastic  but  deterministic, 


X(s)  =f(s), 


where  f(s )  is  continuous.  Then,  the  conditions  of  existence  are  fulfilled  which 
can  easily  be  verified:  The  square,  /^(s)  =  f2(s ),  is  continuous  as  well,  and  the 
deterministic  function  is  independent  of  W(s).  Hence,  it  holds  that 


E 


/(Si- 1)  (W(s.)  -  W(Si-i 


f(s)dW(s). 


In  other  words:  For  deterministic  processes,  X(s )  =  f(s ),  the  Stieltjes  and  the  Ito 
integral  coincide;  the  former  is  a  special  case  of  the  latter.  Due  to  E  (f2(s ))  =  f2  (s) 
the  already  familiar  formulas  for  expectation  and  variance  from  Proposition  9.2  are 
embedded  in  the  general  Proposition  10.3.  ■ 


Distribution  and  Further  Properties 


As  is  well  known,  the  special  case  of  the  Stieltjes  integral  is  Gaussian.  For  the  Ito 
integral  this  does  not  hold  in  general.  This  can  clearly  be  seen  in  (10.3): 


/ 


W(s)  dW(s) 


W2(t )  -  t  -t 

- — >  

2  ”  2  ’ 


i.e.  the  support  of  the  distribution  is  bounded.  Then  again,  the  integral  of  a  WP 
with  respect  to  a  thereof  stochastically  independent  WP  amounts  to  a  Gaussian 
distribution.  The  following  result  is  by  Phillips  and  Park  (1988). 


Proposition  10.4  (Ito  Integral  of  an  Independent  WP)  Let  W(t )  and  V(s)  be 
stochastically  independent  Wiener  processes.  Then  it  holds  that 


-0.5 


(s)  ds 


f 

Jo 


V (s)  dW (s)  ~  7V(0, 1). 
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By  showing  the  conditional  distribution  of  the  left-hand  side  given  V(t)  to  just 
follow  a  A/\0, 1)  distribution  and  therefore  not  to  depend  on  this  condition, 
one  proves  the  result  claimed  in  Proposition  10.4;  see  Phillips  and  Park  (1988, 
Appendix)  for  details. 

Note  that  the  Ito  integral  again  defines  a  stochastic  process  in  the  bounds  from  0 
to  t  whose  properties  could  be  discussed,  which  we  will  not  do  at  this  point.  In  the 
literature,  however,  it  can  be  looked  up  that  the  Ito  integral  and  the  Wiener  process 
share  the  continuity  and  the  martingale  properties.  What  is  more,  the  integration 
rules  outlined  at  the  end  of  Sect.  8.2  hold  true  for  Ito  integrals  as  well,  see  e.g. 
Klebaner  (2005,  Thm.  4.3). 


Diffusions 

For  economic  modeling  the  Ito  integral  is  an  important  ingredient.  However,  it  gains 
its  true  importance  only  when  combined  with  Riemann  integrals.  In  the  following 
chapters,  the  sum  of  both  integrals  constitutes  so-called  diffusions  (diffusion 
processes).  Hence,  we  now  define  processes  X(t)  (with  starting  value  X(0))  as 
follows: 


/i(s)  ds  +  /  a(^)  dW(s)  . 

Jo 

Frequently,  we  will  write  this  integral  equation  in  differential  form  as  follows: 

dX(t)  —  [i(t)  dt  +  o(t)  dW(t)  . 

The  conditions  set  to  fi(s)  and  a(^)  that  guarantee  the  existence  of  such  processes 
can  be  adopted  from  Propositions  8.1  and  10.3.  In  general,  fi(s)  and  a(s)  are 
stochastic;  particularly,  they  are  allowed  to  be  dependent  on  X(s)  itself.  Therefore, 
we  write  fi(s)  and  <j(s)  as  abbreviations  for  functions  which  firstly  explicitly  depend 
on  time  and  secondly  depend  on  X  simultaneously: 

fi(s)  —  fi  (, s,X(s ))  ,  a(s)  =  a  (s,X(s))  . 

Processes  /x  and  a  satisfying  these  conditions  are  used  to  define  diffusions  X(t)\ 

dX(t)  =  /x  (t,  X(t))  dt  +  a  (f,  X(t))  dW(t )  ,  t  G  [0,  T] .  (10.6) 


m  -  x(o)  +  f 

Jo 


3 The  name  stems  from  molecular  physics,  where  diffusions  are  used  to  model  the  change  of 
location  of  a  molecule  due  to  a  deterministic  component  (drift)  and  an  erratic  (stochastic) 
component.  Physically,  the  influence  of  temperature  on  the  motion  hides  behind  the  stochastics: 
The  higher  the  temperature  of  the  matter  in  which  the  particles  move,  the  more  erratic  is  their 
behavior. 
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Recall  that  this  differential  equation  actually  means  the  following: 

X(t)  —  X(0)  +  f  i±  (s,X(s))  ds  +  f  cr  (, s,X(s ))  dW(s )  . 

Jo  Jo 

Example  10.3  ( Brownian  Motion  with  Drift)  We  consider  the  Brownian  motion 
with  drift  and  a  starting  value  0: 

X(t)  =  pit  +  cr  W(t ) 

—  pi  /  ds  +  cr 

Jo 

Therefore,  the  differential  notation  reads 

dX(t)  —  pidt  +  a  dW(t)  . 

Hence,  this  is  a  diffusion  whose  drift  and  volatility  are  constant: 

pi  (t,X(t))  =  pi  and  a  (7,X(f))  —  o  .  ■ 


1 0.4  (Quadratic)  Variation 


From  (10.3)  we  know  that 


f 


W(s)  dW(s)  = 


W2(t )  -  t 


Now,  we  want  to  understand  where  the  expression  t  comes  from  that  is  subtracted 
from  W2(t).  It  will  be  made  clear  that  this  is  the  so-called  quadratic  variation. 


(Absolute)  Variation 

Again,  the  considerations  are  based  on  an  adequate  partition  of  the  interval  [0,  t], 

Pn  ([0,  *])  :  0  =  so  <  s\  <  ...  <  sn  =  t . 

For  a  function  g  the  variation  over  this  partition  is  defined  as  : 

n 

v„(g,t )  =  -#(>,-1)1 . 

i=  1 

4 Sometimes  we  speak  of  absolute  variation  in  order  to  avoid  confusion  with  e.g.  quadratic 
variation. 
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If  the  limit  exists  independently  of  the  decomposition  for  n  ->  oo  under  (8.2),  then 
one  says  that  g  is  of  finite  variation  and  writes  : 

Vn(g,t)  ->  V(g,  t)  ,  n  >  oo  . 

The  finite  sum  V„(g,  t)  measures  for  a  certain  partition  the  absolute  increments  of 
the  function  g  on  the  interval  [0,  t\.  If  the  function  evolves  sufficiently  smooth,  then 
V(g,  t)  takes  on  a  finite  value  for  (n  ->  oo).  For  very  jagged  functions,  however,  it 
may  be  that  for  an  increasing  refinement  (n  ->  oo)  the  increments  of  the  graph  of  g 
become  larger  and  larger  even  for  fixed  t,  such  that  g  is  not  of  finite  variation. 

Example  10.4  ( Monotonic  Functions )  For  monotonic  finite  functions  the  variation 
can  be  calculated  very  easily  and  intuitively.  In  this  case,  V(g,t )  is  simply  the 
absolute  value  of  the  difference  of  the  function  at  endpoints  of  the  interval, 
\g(t)  —  g(0)|.  First,  let  us  assume  that  g  grows  monotonically  on  [0,  t\, 

g(sd>g(si- 1)  for  s,  >  Si- 1  . 


Obviously,  it  then  holds  by  (8.3)  that 

n 

Vn(g,  t)  =  ^2  (g(Si)  -  g(Si-l))  =  g(t )  -  £(0)  =  V(g,  t)  . 

i=  1 

For  a  monotonically  decreasing  function,  it  results  quite  analogously: 

n 

Vn(g,t )  =  \g(Si)  -£(s,--i)| 

i=  1 

n 

=  -  gfa-o) 

i=  1 

=  g(0)  -  g(t) 

=  v(g,  t ) . 

Monotonic  functions  are  hence  of  finite  variation.  ■ 

Without  the  requirement  of  monotonicity,  an  intuitive  sufficient  condition  exists 
for  the  function  to  be  smooth  enough  to  be  of  finite  variation  where  this  variation 
then  has  a  familiar  form  as  well. 


5 If  g  is  a  deterministic  function,  then  means  the  usual  convergence  of  analysis.  If  we  allow 

2 

for  g(t)  to  be  a  stochastic  process,  then  we  mean  the  convergence  in  mean  square: 
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Proposition  10.5  (Variation  of  Continuously  Differentiable  Functions)  Let  g  be 

a  continuously  differentiable  function  with  derivative  g'  on  [0,  t\.  Then  g  is  of  finite 
variation  and  it  holds  that 


v(g,t)=  f  1/001  ds. 

Jo 

The  proof  is  given  in  Problem  10.6. 

Example  10.5  (Sine  Wave)  Let  us  consider  a  sine  cycle  of  the  frequency  k  on  the 
interval  [0,  2n]\ 


gk(s)  =  sin (ks),  k=  1,2, . . .  . 


The  derivative  reads 


gk(s)  —  k  cos  (ks). 


Accounting  for  the  sign  one  obtains  as  the  variation: 


V(gi,2n) 


p  '2.71 

Jo 


cos (5)  |  ds  = 


pjz/2 

=  4  / 

Jo 


cos(,s)  ds 


—  4  ^sin  ^  —  sin 
=  4 , 


-L 


2jt  />7t/4 

V(g2,2jt)  —  I  2|cos(2j’)|  ds  =  8  J  2cos(2 s)ds 


8  ^sin  ^  —  sin 


=  8, 


-L 


2  jt  p  71 1 2k 

V(gk,  2tt)  =  j  k\cos(ks)\  ds  =  4k  I  k  cos  (ks)  ds 


4k  ^sin  ^  —  sin 


=  4k. 


In  Fig.  10.1  it  can  be  observed,  how  the  sum  of  (absolute)  differences  in  amplitude 
grows  with  k  growing.  Accordingly,  the  absolute  variation  of  gk(s)  —  sin(Ls') 
multiplies  with  k.  For  k  —>  00,  g'k  tends  to  infinity  such  that  this  derivative  is  not 
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sine  wave 


Fig.  10.1  Sine  cycles  of  different  frequencies  (Example  10.5) 


continuous  anymore.  Consequently,  the  absolute  variation  is  not  finite  in  the  limiting 
case  k  ->  oo.  ■ 


Quadratic  Variation 

In  the  same  way  as  Vn(g,  t)  a  g- variation  can  be  defined  where  we  are  only  interested 
in  the  case  q  —  2,  -  the  quadratic  variation: 

n  n 

Qn(g, t)  =  l#CO  -  gfe-OI2  =  X!  fefa) _ g(si-0)2  ■ 

i=  1  i=  1 

As  would  seem  natural,  g  is  called  of  finite  quadratic  variation  if  it  holds  that 

Qn(g,  t)  ->  Q(g,  t),  n  — >  oo  . 
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If  g  is  a  stochastic  function,  i.e.  a  stochastic  process,  then  Q(g,  t)  and  V(g,  t )  are 
defined  as  limits  in  mean  square.  Between  the  absolute  variation  V(g,t )  and  the 
quadratic  variation  Q(g,  t)  there  are  connections  which  we  want  to  deal  with  now.  If 
a  continuous  function  is  of  finite  variation,  then  it  is  of  finite  quadratic  variation 
as  well,  where  the  latter  is  in  fact  zero.  This  is  the  statement  of  the  following 
proposition.  As  it  seems  counterintuitive  at  first  sight  that  Q ,  as  the  limit  of  a  positive 
sum  of  squares  Qn ,  can  become  zero,  we  start  with  an  example. 

Example  10.6  ( Identity  Function )  Let  id  be  the  identity  function  on  [0,  t}\ 

id(s)  —  s. 

As  the  functions  increases  monotonically,  it  is  of  finite  variation  with 

V (id,  t)  —  id(t)  —  id( 0)  =  t . 

For  finite  n  it  holds  that: 

n 

Q„(id,  t)  =  F.  (Ufa)  -  id(si-\))2 

i=  1 

= it,  -  s«-i)2 

i=  1 

>  0. 

Qn  consists  of  n  terms,  where  the  lengths  si  —  Si-\  >  0  are  of  the  magnitude  x~.  Due 
to  the  squaring,  the  n  terms  are  of  the  magnitude  ^ .  Hence,  the  sum  converges  to 
zero  for  n  oo.  This  intuition  can  be  formalized  as  follows: 

n 

Qn  (Id,  t )  —  ^  '  (Si  $i—  l) 

i=  1 

<  max  (s; 

~  1  <i<n 

—  max  (si 

1  <i<n  V 

=  max  (st-Si-i)  t 

\<i<n 

^o, 

as  max(,v,  —  .v,_ i )  — >  0  for  n  — >■  oo.  ■ 


n 

-  0  (s,  -  i) 

i=  1 

-  Si- 1)  VH(id,  t) 
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The  next  proposition  gives  a  sufficient  condition  for  the  quadratic  variation  to 
vanish. 

Proposition  10.6  (Absolute  and  Quadratic  Variation)  Let  g  be  a  continuous 
function  on  [0,  t\.  It  then  holds  under  (8.2)  for  n  ->  oo: 


Vn(g,t)  -*  V(g,t)  <  OO 


implies 


Qn(g,( )  0. 

If  g  is  a  stochastic  process ,  then  > ”  is  to  be  understood  as  convergence  in  mean 
square. 

The  proof  is  given  in  Problem  10.7.  From  the  proposition  it  follows  by  contraposi¬ 
tion  that:  If  we  have  a  positive  (finite)  quadratic  variation,  then  the  process  does  not 
have  a  finite  variation.  Formally,  we  write:  From 

Qn(g,  0  -»  Q(g,  t)  <  00  with  Q(g.  t)  >  0 

it  follows  that  there  is  no  finite  variation: 

Vn(g,  t)  ->  oo  . 

If  a  function  g  is  so  smooth  that  it  has  a  continuous  derivative,  then  Q(g,  t)  —  0  by 
Propositions  10.5  and  10.6;  the  other  way  round,  values  of  Q(g,  t)  >  0  characterize 
how  little  smooth  or  jagged  the  function  is. 


Wiener  Processes 

As  we  know,  the  WP  is  nowhere  differentiable,  therefore  it  is  everywhere  so  jagged 
that  there  is  no  valid  tangent  line  approximation.  Due  to  this  extreme  jaggedness  the 
WP  is  of  infinite  variation  as  well,  as  we  will  show  in  a  moment.  More  explicitly, 
we  prove  that  the  WP  is  of  positive  quadratic  variation  and  does  not  have  a  finite 
absolute  variation  due  to  Proposition  10.6.  We  save  the  proof  for  an  exercise 
(Problem  10.8). 
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Proposition  10.7  (Quadratic  Variation  of  the  WP)  For  the  Wiener  process  with 
n  oo  it  holds  under  (8.2): 

Qn(W,t )  4  t  =  Q(W,t). 

The  expression  Q(W,  t)  —  t  characterizes  the  level  of  jaggedness  or  irregularity 
of  the  Wiener  process  on  the  interval  [0,  t\.  This  non- vanishing  quadratic  variation 
causes  the  problems  and  specifics  of  the  Ito  integral.  Let  us  recapitulate:  If  the 
Wiener  process  was  continuously  differentiable,  then  it  would  be  of  finite  variation 
due  to  Proposition  10.5  and  it  would  have  a  vanishing  quadratic  variation  due  to 
Proposition  10.6.  However,  this  is  just  not  the  case. 


Symbolic  Notation 

In  finance  textbooks  one  frequently  finds  a  notation  for  time  that  is  strange  at  first 
sight: 

(dW(t))2  =  dt .  (10.7) 


How  is  this  to  be  understood?  Formal  integration  yields 

(< dW(s ))2  =  t. 

As  would  seem  natural,  the  “integral”  on  the  left-hand  side  here  stands  for  Q(W ,  t ): 


n  2  r1 

Qn(W,t)  =  V  (W(Si)  -  W(Si- 1))2  -*  /  (dW(s)  f  :=  Q(W,  t) . 

i=  1 

Therefore,  the  integral  equation  and  hence  (10.7)  is  justified  by  Proposition  10.7: 
Q(W,  t)  —  t.  We  adopt  the  result  into  the  following  proposition.  The  expressions 


dW(t)dt  =  0  and  (dt)2  —  0 


(10.8) 


are  to  be  understood  similarly,  namely  in  the  sense  of  Proposition  10.8. 
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Proposition  10.8  (Symbolic  Notation)  It  holds  for  n  —>  oo  under  (8.2). 


n  ? 

(a)  T  ( W (Sj)  -  ms,-!  ))2  ->  /  ( dW(s ))2  =  t, 

J  0 


/=  1 


0)  £  _  (si  -  Si-i)  [  dw(s)ds 

Jo 


=  0, 


/=  1 


(c)  ->  f  (dsj2  =  0 

;=i  Jo 


In  symbols,  these  facts  are  frequently  formulated  as  in  (10.7)  and  (10.8). 


Note  that  the  expression  in  (c)  in  Proposition  10.8  is  the  quadratic  variation  of  the 
identity  function  id(s)  —  s\ 

Qn(id,  t)  —>  f  (< ds )2  :=Q(id,t)  =  0. 

Jo 

Hence,  the  third  claim  is  already  established  by  Example  10.6.  The  expression  from 
(b)  in  Proposition  10.8  is  sometimes  also  called  covariation  (of  W(s)  and  id(s)  —  s). 
The  claimed  convergence  to  zero  in  mean  square  is  shown  in  Problem  10.9. 


1 0.5  Problems  and  Solutions 

Problems 

10.1  Prove  Proposition  10.1  for  y  =  0  (Ito  integral). 

Hint:  Use  Proposition  10.7. 

10.2  Prove  the  autocovariance  from  Proposition  10.2. 

10.3  Derive  that  the  Ito  integral  from  (10.3)  and  the  corresponding  Stratonovich 
integral  have  the  same  variance. 

10.4  Show  for  Sn(W)  from  (10.1)  with  s*  from  Proposition  10.1, 

s*  =  (1  -  y)  s^ i  +ysi9  0  <  y  <  1 , 
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that  it  holds: 


E(S„(W))  =  yt . 


10.5  Show:  fn  0  with 


rn  =  W  ((1  -  y)  Si-1  +  y  Si)  -  [(1  -  y)  Wfo-0  +  y  W(si)] 


for  y  G  [0, 1]  for  an  adequate  partition,  i.e.  for  S[  —  Si- 1  ->  0. 

10.6  Prove  Proposition  10.5. 

10.7  Prove  Proposition  10.6. 

10.8  Determine  the  quadratic  variation  of  the  Wiener  process,  i.e.  verify  Proposi¬ 
tion  10.7. 

10.9  Show  (b)  from  Proposition  10.8. 

10.10  Determine  the  covariance  of  W(V)  and  W(r)  dW(r)  for  s  <  t. 

Solutions 

10.1  In  order  to  prove  Proposition  10.1  for  y  —  0,  it  has  to  be  shown  that  In(W) 
from  (10.2)  converges  in  mean  square,  namely  to  the  expression  given  in  (10.3).  For 
this  purpose,  we  write  In(W)  as  follows: 


n 


UW)  =  J2  W(Si-l)  (W(Si)  ~  W(,S,_|)) 


1=1 


n  n 


2  2  E  W(Si)  Wis,-!) 


1=1 


n 


( W2(Si )  -  2  W(s,)  WiSj-i)  +  W2(si-]  )) 
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where  for  the  last  equation  we  added  zero,  ( W2(si )  —  W2(si)).  Hence,  it  furthermore 
follows  by  means  of  the  quadratic  variation: 

n 

W2(sn)  -  W2(s0)  -  J2  (W(sd  -  Wfe-i))2 

i=  1 

=  \  ( W2(t)-W2(Q))-l-Qn(W,t ). 

The  Wiener  process  is  of  finite  quadratic  variation  and  we  know  from  Proposi- 

2 

tion  10.7:  Qn(W,  t)  ->  t.  This  verifies  the  claim  (as  it  holds  that  W(0)  =  0  with 
probability  1). 

10.2  Based  on  (10.3)  we  consider  the  process 

m  = 

Due  to  the  vanishing  expected  value,  it  holds  for  the  autocovariance  that: 

W(t)l(s))  =  l-E  [W2(t)W2(s)  -  tW2(s )  -  sW2(t)  +  st] 

=  1  [E  (W2(t)W2(s))  -  ts  -  st  +  st] . 


f 


W(s)dW(s)  = 


W2(t )  -  t 


By  adding  zero  one  obtains: 


E[Wl (t)Wl {s)\  =  E  (W(t)  -  W(s)  +  W(s))2  W2(s ) 


=  E 


(  W(t)  -  W(s))2  W2(s ) 


+2E  [(W(r)  -  W(s))W3(s)]  +  E  [W4^)] 


If  we  assume  w.l.o.g.  that  s  <  t,  then  due  to  the  independence  of  non-overlapping 
increments  of  W  it  holds  that: 


(W(t)  -W(s))2W2(s)  =  E  (W(t)-W(s))2  E  [W2(s)] 

=  Var  (W(t)  -  W(s ))  Var(Wr(j)) 


=  (t  —  s)s 
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and 


E  [(W(0  -  W(s))W3(s)]  =  E (W(t)  -  W(s))E(lE3(s))  =  0. 

As  the  kurtosis  of  a  Gaussian  random  variable  is  3,  it  follows  that 

E[W4(s)]  =  3s2. 

Therefore,  these  results  jointly  yield  the  claimed  outcome: 

E(/(f)  /(v))  = 

10.3  We  know  about  the  aforementioned  Ito  integral  from  Proposition  10.2  that  its 
variance  is  t2  /  2.  Hence,  it  is  to  be  shown  that: 

Var  QT  W(s)  'dW(s) 

We  use  Proposition  10.1, 


(i t  —  s)s  3  j  st 

- T  —s  —  — 

4  4  4 


—  ,  s  <  t 
2 


from  which  it  follows  immediately  that 


E 


W0)  3  W(s) 


t 

2  ' 


The  usual  variance  decomposition,  see  (2.1),  hence  yields 


Var 


{E(W\t))-t2)  . 


Due  to  a  kurtosis  of  3,  the  fourth  moment  of  a  J\f(0,  t) -distribution  just  amounts  to 
3  t2.  Thus,  one  obtains 


Var 


(3'2-<2)=^. 


which  proves  the  claim. 
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10.4  The  expectation  of  the  sum  Sn(W )  is  equal  to  the  sum  of  the  expectations. 
Therefore,  we  consider  an  individual  expectation, 

E  [W(s*)  (W(Si)  -  Wfo-0]  =  E  [W(s*)W(s,)  - 

=  min(.v*. -  min(.v*..v,_i) 

=  (1  -  Y)  i  +ysi-  Si- 1 
=  y  (si  ~  Si- 1), 

where  simply  the  well-known  covariance  formula  was  used.  Hence,  summation 
yields  as  desired 

n 

E  (S„(W))  =  y  -  St- 1) 

i=  1 

=  y(sn  -  s0) 

=  y(t-  0) . 

10.5  Convergence  in  mean  square  implies  that  the  mean  squared  error  tends  to  zero. 
The  MSE  with  the  limit  zero  reads  MSE(F„,  0)  =  E(Fn2).  Therefore,  it  remains  to 
be  shown  that:  E(FW2)  ->  0. 

For  this  purpose  one  considers  with  s*  —  (1  —  y)si-\  +  y  sy 

r„  =  w2(s*)  -  2 W(s*)  [(1  -  y)W(s,-i)  +  Y  W(s/)] 

+  (1  -  y)2  W2(Si- 1)  +  2]/  (1  -  y)  Wfe-OWfe) 

+  y2w2(Si). 

Forming  expectation  yields: 

E(rn)  =s*  -2  [(1  -  y)sj-i  +  y  s*  ]  +  (1  -  yfsi-y 
+2y(l  -  y)  Si-i  +  y2  Si 

=  (1  -  y)si- 1  +  y  Si  -  2(1  -  y)  s,-_  1  -  2y(l  -  y)s,- 1  -  2  y2  st 
+(1  -  y)2  s,- 1  +  2  y(l  -  y)  Sj-i  +  y2  st 
=  s/(y  -  y2)  +  s^  i  ((1  -  y)2  -  (1  -  y}) 

=  si(y-y2)  +  Si-l(Y2-y ) 

=  (si-si-i)y(l  -y). 

Hence,  for  n  ->  oo  the  required  result  is  obtained  as  S[  —  S{~ \  tends  to  zero. 
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10.6  For  a  given  partition  we  write 


v„(g,  t)  =  Yl  IsCO 

i=  1 


n  rsi 

i'fc-i)i  =  X!  / 

i=\  ^Si~] 


g'(s)ds 


As  the  derivative  is  continuous,  |g'(.s)|  is  continuous  as  well  and  hence  integrable. 
According  to  the  mean  value  theorem  an  s*  e  [s,-_i ,  .s,]  exists  with 


g'(s*)  (.S’,  -  .S,_  I  )  . 


Thus  it  follows  that 


Vn(g,  t)  =  ^  I^^DI  fa  ~  S’~  I  ) 

i=  1 


ds . 


Quod  erat  demonstrandum. 

10.7  The  claim  is  based  on  the  bound 


Qn(g ,  t)  <  max  (|,t?(.s,)  -  g(si _i)|)  >  _  |,!?(.s,)  -  g(s,-i)| 


=  max  (|  gfa)  -  g(si- 1)  |)  Vn(g,  t) . 
1  </<// 


Due  to  continuity  it  holds  that 

max  (|  g(s<)  -g(,s,-i)  |)  ->  0. 

1  <i<n 


Hence,  the  claim  immediately  follows  from  the  bound. 

10.8  It  is  to  be  shown  that  the  mean  squared  error, 

MSE  (Qn(W,  0,  t)  =  E  [(< Qn(W ,  t)  -  t )2]  , 

tends  to  zero.  For  this  purpose  we  proceed  in  two  steps.  In  the  first  one  we  show 
that  the  MSE  coincides  with  the  variance  of  Qn(W,  t).  In  the  second  step  it  will  be 
shown  that  the  variance  converges  to  zero. 


1 0.5  Problems  and  Solutions 
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(1)  For  the  first  step  we  only  need  to  derive  E (Qn(W,  t ))  =  t.  With 

n 

Qn(W,t)  =  Yl(W(s,)-W(si-l))2 

i=  1 

the  required  expectation  can  be  easily  determined: 

n 

E (Q„(W,t))  =  ^  Var ( W (s, )  -  W(s,-t)) 

i=  1 

n 

=  -  Ji-i) = sn  ~  s0  =  t  -  o 

i=  1 
=  t. 

(2)  Due  to  the  independence  of  the  increments  of  the  WP  one  has 

n 

Var  (Qn(W,t))  =  £Var[(W(j,)  -  Wfe-i))2]  • 

i=  1 

Due  to  W(si)  —  W(si- 1)  ~  J\f(0,Si  —  Si- 1)  and  with  a  kurtosis  of  3  for  Gaussian 
random  variables,  it  furthermore  holds  that: 

Var  [(W(ij)  -  W(s,-i  ))2]  =  E[(W(,s,)  -  W^j-O)4]  -  (E  [( W(s,)  -  Wfe-O)2])2 

=  3  [Var(W<>,)  -  IVfe-O)]2  -  (s,  -  ,sv_,)2 

=  2  (st  -  S;-i)2  . 


Hence,  plugging  in  yields 


Var(g„(W,f))  =  2  ^(sj  -  Sj-i)2 

i=  1 


n 


< 


2  max  (st  -  s,_i)  V(.v,  -  .s,_ , ) 

\<i<n  ‘  J 

i=  1 


=  2  max  ( si  -  s,_i)  (sn  -  s0) 

\<i<n 

— >  0  ,  n  — >  oo  , 


which  completes  the  proof. 
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10.9  Let  us  call  the  aforementioned  covariation  CVn : 

n 

CVn  =  J2  ( W(si )  -  W(s,-ty)  (Si  -  Si- 1) . 

i=  1 

The  claim  reads:  MSE(CV„,  0)  ->  0.  As  it  obviously  holds  that  E (CVn)  =  0,  we 
obtain 


MSE(CV„,  0)  =  Var (C14) . 

Hence,  it  remains  to  be  shown  that  this  variance  tends  to  zero:  Due  to  the 
independence  of  the  increments  of  the  WP,  one  determines 

n 

Var (CVn)  =  J2  Var(W(s;)  -  W(Si- 0)  (s,  -  s,-_ i)2 , 

i=  1 


and  hence 


n 


Var  (CVn)  = 


i=  1 


n 


< 


max  (si  -  st- 1)  T(Si  -  .s(_i ) 

1  <i<n  ‘  J 

i=l 


=  max  (st  -  s^ i)  Qn(id,  t) 

\<i<n 

->  0, 

where  Qn(id ,  0  is  the  quadratic  variation  of  the  identity  function,  see  Example  10.6. 
Hence,  the  claim  is  established. 

10.10  We  want  to  obtain  the  expected  value  of  y(5\  t)  with 

Y(s,  t)  :=  W(s)  f  W(r)dW(r)  ,  s  <  t. 

Jo 

Due  to  (10.3)  it  again  holds  that: 


E  (Y(s,  0)  =  E 


1 


W(s) 


W2(t )  -  t 


=  -E[W(s)W2(f)]. 
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Therefore,  we  study 

E  [W(s)W20)]  =  E  [W(s)(W(t)  -  W(s)  +  W(s))2] 

=  E[W(s)(W(f)  -  W(s))2] 

+2E  [W2(s)(IE(0  -  W(s))]  +  E[W3(s)]. 

Let  us  consider  the  last  three  terms  one  by  one.  Due  to  the  independence  of  the 
increments,  one  obtains: 

E  [W(V)(W(0  -  W(s))2]  =  E(W(»)E  ((W(r)  -  W(s))2)  =  0. 

Moreover,  it  is  obvious  that  the  second  term  is  also  zero.  For  the  third  term  the 
symmetry  of  the  Gaussian  distribution  yields  E  (W3(V))  =  0.  Summing  up,  we  have 
shown  that 


E(w(y»w2(0)  =  o,  s<t. 


and  hence 


E 


W(r)dW(r) 


s  <  t. 
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Ito's  Lemma 


11.1  Summary 

If  a  process  is  given  as  a  stochastic  Riemann  and/or  Ito  integral,  then  one  may 
wish  to  determine  how  a  function  of  the  process  looks.  This  is  achieved  by  Ito’s 
lemma  as  an  ingredient  of  stochastic  calculus.  In  particular,  stochastic  integrals  can 
be  determined  and  stochastic  differential  equations  can  be  solved  with  it;  we  will 
get  to  know  stochastic  variants  of  familiar  rules  of  differentiation  (chain  and  product 
rule).  For  this  purpose  we  approach  Ito’s  lemma  step  by  step  by  first  discussing  it 
for  Wiener  processes,  then  by  generalizing  it  for  diffusion  processes  and  finally  by 
considering  some  extensions. 


1 1 .2  The  Univariate  Case 

The  WP  itself  is  a  special  case  of  a  diffusion  as  defined  in  (10.6).  With 


[i  ( t ,  W(t))  =  0  and  a  ( t ,  W(t ))  =  1 


Eq.  (10.6)  becomes  (with  probability  one) 


Thus,  we  consider  this  special  case  first. 
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1 1  Ito's  Lemma 


For  Wiener  Processes 

As  a  revision,  let  us  recall  (10.3),  which  can  be  written  equivalently  as 

2  f  W(s)dW(s )  =  W2(t)  -  i . 

Jo 

If  g(W)  =  W2  is  defined  with  derivatives  g'(W)  —  2 W  and  g"(W )  =  2,  then  this 
equation  can  also  be  formulated  as  follows: 

2  (W(s))  dW(s)  =  g  ( W(t ))  -  t 

=  g(W(t))-\  f  g"(W(s))ds. 

Now,  this  is  just  the  form  of  Ito’s  lemma  for  functions  g  of  a  Wiener  process.  It  is  a 
corollary  of  the  more  general  case  (Proposition  11.1)  which  will  be  covered  in  the 
following.  Throughout,  we  will  assume  that  g  has  a  continuous  second  derivative 
(“twice  continuously  differentiable”). 

Corollary  11.1  (Ito’s  Lemma  for  WP)  Let  g  :  R  ->  M  be  twice  continuously 
differentiable.  Then  it  holds  that 

dg  {Wit))  =  g'  ( W(t ))  dW(t)  +  l-  g"  ( W(t ))  dt . 

In  integral  form  this  corollary  to  Ito’s  lemma  is  to  be  read  as  follows: 

8  (W(t))  =  g  (W(0))  +  f  g'  (W(s))  dW{s)  +  l  [  g"  ( W(s ))  ds . 

Jo  z  Jo 

Strictly  speaking,  this  integral  equation  is  the  statement  of  the  corollary,  which 
is  abbreviated  by  the  differential  notation.  However,  in  doing  so  it  must  not  be 
forgotten  that  the  WP  is  not  differentiable.  Sometimes  one  also  writes  even  more 
briefly: 


dg  (W)  =  g'(W )  dW  +  -  g"(W)  dt . 


Example  11.1  (Powers  of  the  WP)  For  g(W)  —  j  W1  this  special  case  of  Ito’s 
lemma  just  proves  (10.3).  In  general,  one  obtains  for  m  >  2  from  Corollary  11.1 
with  g(W)  = 


d 


Wm(t) 


=  Wm~\t)dW(t)  + 


/;/  —  1 


Wm~2(t)dt, 


m 


2 
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or  in  integral  notation 

Wm(t)  =  m  f  W"'~l  (s)  dW(s)  +  "l(m  -  * )  f  Wm~2(s)  ds .  ■ 

Jo  2  Jo 

Explanation  and  Proof 

Corollary  11.1  can  be  considered  as  a  stochastic  chain  rule  and  can  loosely  be 
formulated  as  follows:  the  derivative  of  g(W(t))  results  as  the  product  of  the 
outer  derivative  (g'(W))  and  the  inner  derivative  ( dW ),  plus  an  Ito- specific  extra 
term  consisting  of  the  second  derivative  of  g  times  Where  this  term  comes 
from  (second  order  Taylor  series  expansion)  and  why  no  further  terms  occur 
(higher  order  derivatives),  we  want  to  clarify  now.  For  this  purpose  we  prove 
Corollary  11.1  (almost  completely)  although  it  is,  as  mentioned  above,  a  corollary 
to  Proposition  11.1. 

With  sn  —  t  and  so  =  0  it  holds  due  to  (8.3)  that: 

n 

g(W(t ))  =  s(W(0))  +  J2  (smsd)  -  g(WOi- 0))  ■ 

i=  1 

Now,  on  the  right-hand  side  a  second  order  Taylor  expansion  of  g(W(si))  about 
W(s;_i)  yields 


g(W(Si))  =  g(W(si-i))  +  g'mst- 0)  (W(st)  ~  Wist-0) 


+ 


(W(Si)  -  W(5,_0)2  , 


with  Oi  between  W(si-\  )  and  W(si): 


6i  -  Wfe-01  g  (0,  |  W(Si)  -  Wfe-01)  . 


By  substitution  of  g(W(si))  —  g(W(iS7_i)),  g(W(t))  —  g(W(0))  can  be  expressed  by 
two  sums: 


g(W{t))  -  g(W(0))  =  Si  +  £2 


n 

Si  =  (W(si)  -  ^(5,-0) , 

i=  1 

^  -  \YJg"m  (w(sj)  -  w(s,-ty)2 . 

i=  1 


with 
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Now,  E\  just  coincides  with  the  Ito  sum  from  (10.5)  such  that  it  holds  due  to 
Proposition  10.3  that: 


Zi  X  f  g'(W(s))dW(s). 

Jo 

Furthermore,  we  know  from  the  section  on  quadratic  variation  (Proposition  10.8) 

(« dW(s ))2  =  ds  . 

As  the  quadratic  variation  of  the  WP  is  not  negligible  (Proposition  10.7),  this 
suggests  the  following  approximation: 

^2  ~  1  f  g"(W(s))(dW(s))2 
^  Jo 

g"(W(s))  ds . 

A  corresponding  convergence  in  mean  square  can  actually  be  established,  which 
we  will  dispense  with  at  this  point.  Hence,  except  for  this  technical  detail, 
Corollary  11.1  is  verified. 

Additionally,  we  want  to  consider  why  higher  order  derivatives  do  not  matter  for 
Ito’s  lemma.  For  a  third  order  Taylor  expansion  e.g.  it  follows 


g(W(sd)  -  g(W(Si_i))  =  Efmsi-OXWOi)  -  WOi-O) 

,  g'Wi-O)  ,  „„ 

H - - - ( W (s,)  -  W(Si-i)) 


+ 


g"W 


(W(Si)  -  W(si- 1)): 


Thus,  due  to  the  summation,  the  term 


n 


1=1 


occurs.  However,  it  is  negligible: 


n 


s3\  <  wso-^-oy 


i=  1 


<  max  {|g"'(0,)|  |W(s/)  -  WOi-O]}  ■  QniW,i) 

Ki<n  1  1 


0  •  t  —  0  , 
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as  the  quadratic  variation  of  the  WP  tends  to  t  and  as  it  furthermore  holds  that 


MSE  [W(Si)  -  W(Si-i),  0]  =  Var  (W(st)  -  Wfe-0)  = 


Si  ~  Si- 1 


For  Diffusions 

Now,  we  turn  to  Ito’s  lemma  for  diffusions.  In  this  section,  we  consider  the 
univariate  case  of  only  one  diffusion  that  depends  on  one  WP  only.  The  following 
variant  of  Ito’s  lemma  is  again  a  kind  of  stochastic  chain  rule  and  the  idea  for  the 
proof  is  again  based  on  a  second  order  Taylor  expansion. 

Proposition  11.1  (Ito’s  Lemma  with  One  Dependent  Variable)  Let  g  :  R  ->  R 

be  twice  continuously  differentiable  and  X(t)  a  diffusion  on  [0,  T\  with  (10.6),  or 
briefly: 


dX(t )  =  fi(t)  dt  +  cr(t)  dW(t )  . 


Then  it  holds  that 

dg  ( X(t) )  =  s'  (X(t))  dX(t)  +  I  g"  (X(t))  a2  (t)  dt . 

IfX(^)  =  W(t )  is  a  Wiener  process,  i.e.  fi(t)  =  0  and  <j(t)  —  1,  then  Corollary  11.1 
is  obtained  as  a  special  case. 

The  statement  in  Proposition  11.1  is  given  somewhat  succinctly  It  can  be 
condensed  even  more  by  suppressing  the  dependence  on  time: 

dg  (X)  =  gf  (X)  dX  +  1  g"  (X)  a2  dt . 

However,  it  needs  to  be  clear  that  by  substituting  dX(t)  one  obtains  for  the 
differential  dg  (X(t))  the  following  lengthy  expression: 

g'  (X(t))  n(t,X(t))  +  l-g"  (X(t))  o2{t,X(t)) 

The  corresponding  statement  in  integral  notation  naturally  looks  yet  more  extensive. 

Example  11.2  ( Differential  of  the  Exponential  Function)  Let  a  diffusion  X(t)  be 
given, 


dt  +  g  (X(t))  cr(t,  X(t))  dW(t)  . 


dX(t)  =  gift)  dt  +  cr(t)  dW(t )  . 
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Then,  how  does  the  differential  of  read?  This  example  is  particularly  easy  to 
calculate  as  it  holds  for  g(v)  =  ex  that: 

g"(x)  =  g'(x)  =  g(x)  =  ex. 


Hence  Ito’s  lemma  yields: 

dex ®  =  ex^dX(t)  H — —c>2(t)dt 

—  ex^  ^/x(0  H - ^  ^ ^  G^dW(t). 

If  X(t)  is  deterministic,  i.e.  a(t)  =  0,  then  it  results 

deX(t)  _  x{t)dX(i) 
dt  dt 

which  just  corresponds  to  the  traditional  chain  rule  (outer  derivative  times  inner 
derivative).  ■ 


On  the  Proof 

Just  as  for  the  proof  of  Corollary  11.1,  one  obtains  with  0;,  where 

Ift-Xfe-OI  G  (0,  I X(sd  -  Xisi-Ol)  , 
from  the  Taylor  expansion: 

8(X(t))  -  g(X(0))  =  X,  +  E2  , 

n 

^  =  y>'(X(s,-!))  (X(s,)-X(Si-{)) 

i=  1 

Z2  =  \  X>"(0,)  (X(Si)  -  Xist-0)2  - 

i=  1 

The  first  sum  is  approximated  as  desired: 

f  g'(X(s))dX(s)  . 

Jo 
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The  second  sum  is  approximated  by 

^2  «  \  [  g"(X(s ))  (dX(s))2  . 

By  multiplying  out  the  square  of  the  differential  of  the  Ito  process, 

(« dX(s ))2  =  p2(s)(ds)2  +  2  fi(s)  o(s)  dW(s)  ds  +  <J2(s)  ( dW(s ))2  , 
one  shows  due  to  (cf.  Proposition  10.8), 

(ds)2  —  0  ,  dW(s)  ds  =  0  ,  (dW(s))2  —  ds , 

for  the  second  sum: 


z2  «  1  f  g"(X(s))a2(s)ds . 
^  Jo 

This  verifies  Proposition  11.1  at  least  heuristically. 


1 1 .3  Bivariate  Diffusions  with  One  WP 

A  generalization  of  Proposition  11.1,  which  is  sometimes  needed,  is  presented  by 
the  following  variant  of  Ito’ s  lemma.  The  function  g  be  dependent  on  two  diffusions 
X\  and  X2,  where  both  are  driven  by  the  very  same  Wiener  process.  Occasionally,  we 
will  call  this  case  (referring  to  the  literature  on  interest  rate  models)  the  one-factor 
case  as  it  is  the  identical  factor  W(t)  driving  both  diffusions. 


One-Factor  Case 

Let  g  be  a  function  in  two  arguments,  whose  partial  derivatives  are  denoted  by 

dg(X  UX2)  d2g(XuX2) 

dXi  an  dXi  dXj  ‘ 

Then,  the  following  proposition  is  a  special  case  of  Proposition  11.3. 

Proposition  11.2  (Ito’s  Lemma  with  Two  Dependent  Variables)  Let  g:  R  x  R  -> 

R  be  twice  continuously  differentiable  with  respect  to  both  arguments,  and  let  Xt(t) 
be  diffusions  on  [0,  T\  with  the  same  WP: 

dXi(t)  —  pti(t)  dt  +  Oi(t)  dW(t)  ,  i  —  1,2. 
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Then  it  holds  that 


*«,«.  JWO,  -  a,  W  +  M%M»,X2(0 


axj 


ax, 


i 

+  2 


a^CXKO.XzCO)  ,  a^^co.XzW)  2„f 

- Oy  (0  +  - — - °2  (0 


+ 


3X2 

a2g(x1(f),x2(Q) 
axi  ax2 


3 A 


dt 


G\  (t)  02  (t)  dt . 


Note  that  substitution  of  dXt(t)  in  Proposition  11.2  leads  again  to  an  integral 
equation  for  the  process  g  (X\  (t),  ^(0)  including  Riemann  and  Ito  integrals. 

Frequently,  the  time  dependence  of  the  processes  will  be  suppressed  in  order  to 
obtain  a  more  economical  formulation  of  Proposition  1 1.2: 


dgiXuX 2)  = 


dXi 


1 

+  2 


a2g(X!,x2) 


+ 


ax2 

d2g  (Xi ,  x2) 
axi  ax2 


of  + 


3X2 

a2g(X!,x2) 


ax2 


o\ 


G\  02  dt . 


dt 


By  this  notation  one  recognizes,  that  again  a  second  order  Taylor  expansion  hides 
behind  Proposition  11.2,  but  now  of  the  two-dimensional  function  g. 


dg  (Xi ,  X2)  = 


as(x„w 


axi 


ax, 


1 

+  2 

1 

+  2 


oXy 

d2g(xux2) 


ax,  ax, 


dX\dX 2  T 


ax2 

a2g  (Xi ,  x2) 
ax2  ax! 


dX2dX\ 


because  the  mixed  second  derivatives  coincide  due  to  the  continuity  assumed.  With 
(10.7)  and  (10.8)  it  can  easily  be  shown  that  (we  again  suppress  the  arguments) 

(dXi)2  —  ji2  (dt)2  +  2 1 Li  Oi  dtdW  +  of  (dW)2  =  0  +  0  +  o2  dt , 


and  for  the  covariance  expression  as  well 


dX  1  JX2  =  o\  02  dt . 


(H.l) 
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Example  11.3  (One-Factor  Product  Rule)  Proposition  11.2  provides  us  with  a 
stochastic  product  rule  for  Xx  (t)  X2  (t): 


d(Xx  (0  X2  (0)  =  x2  (t)  dXx  (t)  +  X!  (t)  dX2  (0  +  <J\  (0  o2  (0  dt 


(11.2) 


Under  ox  (t)  —  0  or  o2(t)  —  0  (no  stochastics),  the  well-known  product  rule  is  just 
reproduced.  The  derivation  of  (1 1.2)  follows  for  g(vi ,  x2)  —  xx  x2  with 


dg 

dxx 

dg 

3X2 


=  X2, 


=  Xu 


dtg 

dx2 

d2g 

dxj 


=  0, 


=  0 


and 


d2g 


3  2g 


dxx  dx2  dx2  dx  \ 


=  1 


Hence,  we  obtain  an  abbreviated  form: 


j  v  ,  dg  (^1 ,  ^2)  ,  dg  (Xt ,  X2) 

d  (ai  X2)  —  - — - dX  1  H - — - dX2  +  crx  o2  dt , 


dxx 


8X: 


where  the  second  derivatives  were  plugged  in.  If  one  substitutes  the  first  derivatives, 
then  one  obtains  the  result  from  (11.2).  ■ 


Time  as  a  Dependent  Variable 

Frequently  it  is  of  interest  to  consider  another  special  case  of  Proposition  11.2. 
Again,  g  is  a  function  in  two  arguments;  however,  the  first  one  is  time  t,  and  the 
second  one  is  a  diffusion  X(t): 

g:  [0,  T]  x  R  ->  R 

(t,X)  g(t,X). 

We  consciously  suppress  the  fact  that  the  diffusion  is  time-dependent  as  well.  Since, 
when  we  talk  about  the  derivative  of  g  with  respect  to  time,  then  we  refer  strictly 
formally  to  the  partial  derivative  with  respect  to  the  first  argument.  Sometimes  this 
is  confusing  for  beginners.  For  example  for 


g(t,X(t))  =g(t,X)  =  tX(t) 
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the  derivative  with  respect  to  t  refers  to: 


dg  (t,  XU)) 
dt 


Hence,  for  the  partial  derivatives  we  purposely  do  not  consider  that  X  itself  is  a 
function  of  t. 

With  X\  ( t )  =  t  and  X(t)  —  X2 (t)  from  Proposition  1 1.2  we  obtain  for  ( t )  =  1 

and  G\  (t)  =  0  the  following  circumstance. 


Corollary  11.2  (Ito’s  Lemma  with  Time  as  a  Dependent  Variable)  Let  g  : 

[0,  T]  x  R  R  be  twice  continuously  differentiable  with  respect  to  both  arguments 
and  letX(t)  be  a  diffusion  on  [0,  T\  with  (10.6),  or  briefly 


dX(t)  —  fi(t)  dt  +  cr(t)  dW(t )  . 


Then  it  holds  that 


dg(t,  X  (?) )  = 


dg  (t,  X(t))  ,  dg  (f,  XU))  ,  1  d2g  (U  X(t))  j2 


dt 


dt  H- 


dX(t)  +  - 

dx  w  2  dx2 


a  (t)  dt 


Again,  suppressing  time-dependence,  this  can  be  condensed  to 


dg  (t,  X)  = 


dg  (t,  x)  ,  dg(t,x)  ,  1  d2g(t,  X)  2 


dt 


dt  H- 


dx 


dX+  - 
2 


dx2 


a  dt 


Example  11.4  (OUP  as  a  Diffusion)  As  an  application  we  can  just  prove  that  the 
standard  Ornstein-Uhlenbeck  process  Xc(t)  from  (9.3)  is  a  diffusion  with 

pi  (t,Xc(t))  —  cXc(t)  and  a  (t,Xc(t))  =  1 . 

For  this  purpose  we  define  as  an  auxiliary  quantity  the  process 

X(t)  =  [  e~csdW(s )  , 

Jo 


or 


dX(t)  =  e~ctdW(i)  . 


With  this  variable  we  define  the  function  g  such, 

g(t,X)  =  ectX, 
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that  it  holds  for  the  OUP: 


Xc(t)  =  =  ec,X(t). 


With  the  derivatives 


dg(t,X) 

dt 


3  g(t,X) 

dx 


3  2g(t,X) 
dx 2 


it  follows  from  Corollary  11.2: 


dXc(t )  =  cectX(i)dt  +  ectdX(i)  +  0 
=  cXc{t)dt  +  dW(t), 


where  dX(t)  was  substituted.  ■ 

Further  examples  for  practicing  Corollary  1 1.2  can  be  found  in  the  problem  section. 


if -Variate  Diffusions 

Concerning  the  contents,  there  is  no  reason  why  Proposition  1 1.2  should  be  written 
just  with  two  processes.  Let  us  consider  as  a  generalization  the  case  where  g  depends 
on  K  diffusions,  all  of  them  given  by  the  identical  WP: 

giR^-^R,  i.e.  g  =  g(X1,...,XJC)€M. 

Then  it  holds  with  dXk(t ),  k  —  1 , due  to  a  second  order  Taylor  expansion 
that: 


K 

dg(X i , . . . ,  XK)  — 

k=  1 


dg 

dXk 


1  K  K 

^  +  2  LE 

k=  1  7=1 


dXk  dXj . 


As  in  the  bivariate  case,  one  obtains  dX^dXj  —  Ok  <Jj  dt ,  cf.  (11.1).  Sometimes,  as  in 
Corollary  11.2,  time  as  a  further  variable  is  allowed  for, 

g:  [0,  T]xlf  ^1,  i.e.  g  =  g  (t,Xu  . . .  ,XK)  e  R , 


and 


dg  (t,X i, . . .  ,Xk) 


3 g 

dt 


K 

dt  + 

A=1 


_dg_ 

axk 


k=  1  7=1 


dXk  dXj  . 
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1 1 .4  Generalization  for  Independent  WP 

We  keep  to  the  multivariate  generalization,  however,  allowing  for  several,  stochas¬ 
tically  independent  Wiener  processes  behind  the  diffusions. 


The  General  Case 

Now,  . . .  ,Wd(t)  denote  stochastically  independent  standard  Wiener  pro¬ 

cesses.  We  allow  for  d  factors  driving  each  of  the  K  diffusions.  According  to  this,  let 
X(t)  be  a  /^-dimensional  diffusion  X' (t)  —  (X\  (t), . . .  ,XK(t )),  defined  by  d  factors 
Wj(t),j  =  l, ...  ,d\ 


d 

dXfft)  —  /jLk(t)dt  +  Okj(t)dWj(t ),  k  —  l, ...  ,K. 

7=1 

In  order  to  have  a  diffusion,  it  holds  for  gbk  and  okj  that  they  may  only  depend  on 
X(t)  and  t : 


=  ^k(t,X(t)),  k  =  1, . .  .K, 

Gkj(t)  =  <Jkj(t,  X(t)),  k  =  1, . . .  X,  7  =  1, . . .  d. 

For  a  function  g,  which  maps  X(t)  to  the  real  numbers,  Ito’s  lemma  reads  as  follows, 
cf.  0ksendal  (2003,  Theorem  4.2.1). 

Proposition  11.3  (Ito’s  Lemma  (Independent  WP))  Let  g  :  RK  ->  R  be  twice 
continuously  differentiable  with  respect  to  all  the  arguments,  and  let  Xk(t)  be 
diffusions  on  [0,  T\  depending  on  d  independent  Wiener  processes: 

d 

dXfft)  —  pk(t)dt  +  Gkj(t)dWj{t ),  k  —  l, ...  ,K. 

7=1 

Then  it  holds  for  X'(t )  =  (Xi(t), . . .  ,X%(t))  that 


dg(X(t ))  = 


f  mUu  +  i  ‘ 


K 


k=  1 


ax* 


i=  1  /:=  1 


91 2g(X(Q) 

dXidxk 


dXi(t)dXk(t) 


1  For  vectors  and  matrices,  the  superscript  denotes  transposition  and  not  differentiation.  Further, 
the  dimension  d  of  the  multivariate  Wiener  process  should  not  be  confused  with  the  differential 

operator  denoted  by  the  same  symbol. 
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with 

d 

dXi(t)dXk(t )  =  yo^mdDdt.  (11.3) 

7=1 

Heuristically,  (11.3)  can  be  well  justified.  For  this  purpose  we  consider  vectors  of 
length  d: 


<*k(t) 


OkdU)  / 


and  W(r)  = 


/  Wi  (t)  \ 


such  that 


dXk(t)  —  Hk(t)dt  +  a'k(t)d\V(t)  . 

Neglecting  the  dependence  on  time,  it  follows 

dXtdXk  =  jit  /JLk(dt)2  +  fiia'kdW(t)dt  +  fik  <r'jd\V(t)dt 
+ a  ■  d  W  (t)  a  [d\V  (t) 

=  a  -dW (t) dW/  (t)ak 

due  to  (see  Proposition  10.8) 

(dt)2  =  0  and  dWj(t)dt  =  0 

and 


cr'kd\V(t)  =  (a'kd\V(t))'  =  dW'(t)crk- 


Let  us  consider  the  matrix 


dW(t)  dW(t) 


/  (dWi  (0)2  dW\  ( t)dW2  (t)  ...  dW\  ( t)dWd  (0  \ 
dW2 (t)dWx (0  (dW2 (t))2  ...  dW2 ( t)dWd ( t ) 


\dWd(t)dWi (t)  dWd(t)dW2(t)  . . .  {dWd{t))2  ) 


As  is  well  known,  it  holds  due  to  (10.7)  that: 


(, dWjit ))2  =  dt. 
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Furthermore,  it  can  be  shown  for  stochastically  independent  Wiener  processes  that 

dWi(t)dWk(t )  =  0,  i  k. 

Overall,  we  hence  obtain 

dW(t)d\V'(t)  =  Iddt , 


with  the  d-dimensional  identity  matrix  Id.  All  in  all  it  follows 

dXi(t)dXk(t)  =  a\{t)  Iddt  a  k(t) 

=  a'j(t)a  k(t)  dt , 


which  is  given  in  (11.3). 


The  2-Factor  Case 


Let  us  consider  the  case  K  —  d  —  2.  Then,  Proposition  1 1.3  becomes  more  clearly 


«x(0) = ±  ^  s!s(X,')) 


k=l 


dXi 


2 fcrfe 


dXi(l)dXt(l) 


with 


dX\dX\  —  (ofj  +  (ii2)dt , 
dXydXy  —  (c +  g\2  )dt , 


and 


dX\dX2  =  (an  a-21  +  (J\2&22)dt. 


Two  interesting  special  cases  result: 

1.  <Jn  —  cr 22  =  0  (one-factor  model), 

2.  o  12  =  0^2 1  =  0  (independent  diffusions). 

The  first  case  naturally  corresponds  to  the  one  from  the  previous  section:  Both 
diffusions  only  depend  on  the  same  WP.  The  second  case  is  the  opposite  extreme 
where  both  diffusions  depend  only  on  one  or  other  of  the  two  stochastically 
independent  processes: 


dXk  (t)  =  iLk(t)dt  +  okk(t)dWk(t),  k=  1,2. 
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We  want  to  discuss  both  borderline  cases  based  on  two  examples. 

Example  11.5  (2-Factor  Product  Rule)  Proposition  11.3  with  K  —  d  —  2  yields 
with  the  derivatives  from  Example  1 1.3  as  product  rule: 


d(Xi  X2)  =  X2  dX i  +  Xi  dX2  +  dXx  dX2  .  (1 1.4) 

In  the  borderline  case  of  only  one  factor,  naturally  the  result  from  Eq.  (11.2)  is 
reproduced.  In  the  second  borderline  case  of  stochastically  independent  diffusions, 
however,  it  holds,  as  in  the  deterministic  case,  that 


d(X i  X2)  =  X2  dX i  +  Xi  dX2  . 


Without  restrictions  (11.4)  reads  as  follows: 


d(X i  X2 )  —  X2  dX i  +  X\  dX 2  +  ((J\\02\  +  0\2o22)dt . 


The  example  illustrates  that  a  proper  application  of  Ito’s  lemma  needs  to  account 
for  the  number  of  factors  underlying  the  diffusions.  ■ 


Example  11.6  ( 2-Factor  Quotient  Rule)  For  X2(t)  ^  0  and 


g(X  UX2) 


Xx 

X2 


we  obtain: 


-  y-1 

—  a2  , 


dXi 


3  g  2 

=  -XiXp, 


dx 


d2g  d2g  , 

dX2  dX2  2 


d2g 


dXi  dX2 


=  -X 


-2 


Hence  Proposition  11.3  yields  with  K  —  d  —  2  suppressing  the  arguments: 


d 


Xx 


X2  dX x  -  Xx  dX2  Xx  X2  1  (oh  +  oh)  ~  (oxx021  +  o12o22) 


+ 


21  1  22' 


dt 


V  I  v2  y2 

.2  /  2l2  A2 

(11.5) 

If  X2  is  a  deterministic  function  ( o2x  =  o22  —  0),  then  the  conventional  quotient 
rule  is  reproduced.  ■ 
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1 1 .5  Problems  and  Solutions 

Problems 

11.1  Prove  part  (a)  from  Example  9.1. 

Hint:  Choose  g  (t,W)  —  tW  in  Corollary  11.2  or  in  (11.2). 

11.2  Prove  part  (b)  from  Example  9.1. 

Hint:  Choose  g  ( t ,  W)  =  (1  —  t)  W  in  Corollary  11.2  or  in  (11.2). 

11.3  Prove  part  (b)  from  Proposition  9.1  (integration  by  parts). 
Hint:  Choose  g  ( t ,  W)  =  fit)  W  in  Corollary  11.2  or  in  (11.2). 

11.4  Prove  statement  (a)  from  Proposition  9.4  with  Ito’s  lemma. 
Hint:  Choose  g  (t,  W)  =  e~ctW. 

11.5  Prove  for  the  OUP  from  Proposition  9.4: 


Note  that  for  c  —  0  (WP)  this  reproduces  (10.3). 
Hint:  Choose  g(Xc(t))  =  X^(t)  in  Ito’s  lemma. 


11.6  Determine  the  differential  of  W(t)/ew ^  according  to  the  one-factor  product 


rule  (11.2). 


11.7  Determine  the  differential  of  W (t) / ewtyt)  directly  from  Corollary  11.1. 


Solutions 


11.1  For  the  proof  we  use  Corollary  1 1.2  with 


g(t,  W)  =  tW. 


The  derivatives  needed  read: 


dg(t,W)  8g(t,W)  t  d2g(t,  W) 

-  —  W  -  —  t  - 

dt  ’  dw  dw2 


Hence,  one  determines  with  Corollary  11.2: 


d{tW(t))  =  W(t)dt  +  tdW(t)  +  - . 
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Due  to  g(0,  1T(0))  =  0,  we  obtain  as  an  integral  equation 

tW(t )  =  f  W(s)ds  + 

Jo 

which  was  to  be  shown. 

11.2  As  in  Problem  11.1  we  consider 

g(t,W)  =  (1  -t)W 


f 


sdW(s), 


with 


dg(t,W)  3  g(t,W)  n  A  d2g(t,W) 

=  —w,  — — —  =  (i-o,  — —  =  o. 


dt 


3  W 


3  W2 


Therefore,  substitution  into  Corollary  11.2  yields 


d((  1  -  t)W(t))  =  —W(t)dt  +  (1  -  t)dW(t)  +  - 


As  W(0)  =  0  with  probability  one,  it  follows  that 


+  f  (1  -  s)dW(s), 

Jo 

which  was  to  be  shown. 

11.3  As  an  adequate  function  g  we  choose 

g(t,W)  =f(t)W , 

wher ef(t)  is  deterministic.  Then,  Corollary  11.2  is  used  with 


--/ 


(1  -t)W(t)  =  -  I  W(s)ds 


hg,  w) 

3 1 


=  f(t)W, 


hit,  w) 

dW 


3 hit,  W) 
dW2 


This  yields  for  the  differential: 

dgit,  wit))  =fit)Wit)dt  +fit)dWit)  + 
In  integral  notation  this  reads 


gif.  Wit))  =  g(0,  W(0))  + 


ff 


( s)W(s)ds  + 


/' 


fis)dw(s). 
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As  W(0)  =  0  with  probability  one,  we  hence  obtain  the  desired  result: 

mw(t)  =  f  f 

Jo 

11.4  If  one  chooses  g(t,  W)  —  e  ctW  with 

dg(t,  W )  _  dg(t,  W )  _  d2g(t,  W)  _ 

dt  C£  ’  dW  £  ’  dW2 

then  Corollary  11.2  allows  for  the  following  calculation: 

d(e~ctW(t ))  =  —ce~ctW(t)dt  +  e~ctdW(t ), 


f(s)W(s)ds  +  f  f(s)dW(s). 

Jo  ' 


i.e. 


e~ctW(t )  =  W( 0)  -  c  [  e~csW(s)ds  +  f  e~csdW(s) 

Jo  Jo 


or 


W(t)  =  -cect  f  e~csW{s)ds  +  ect  f  e~csdW(s ) 

Jo  Jo 


,ct 


= —ce~~  f  e  csW(s)ds  +  Xc(t). 

Jo 


Rearranging  terms  completes  the  proof. 

11.5  With  the  function  g(Xc)  =  X 2  and  its  derivatives, 


g\Xc)  =  2XC,  g"(Xc)  =  2, 

Proposition  11.1  can  be  applied.  We  know  that  Xc(t)  is  a  diffusion  with  (see 
Example  11.4): 

dXc(t)  =  cXc(t)dt  +  dW(t). 

Plugging  in  into  Proposition  11.1  shows: 


dX;(t)  =  2 Xc(t)dXc{t)  +  -dt 

=  (2  cX2(t)  +  1  )dt  +  2 Xc(t)dW(t). 
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With  starting  value  Xc(0)  =  0  this  translates  into  the  following  integral  equation: 

Xc2(0  =  f  (2cX2(s)  +  l)ds  +  2 

Jo 

This  is  equivalent  to 


/ 


Xc(s)dW(s) 


f 


Xc(s)dW(s )  = 


X2c(s)ds , 


which  amounts  to  the  claim. 
11.6  We  define 


X\(t)  =  W(t),  X2(t)  =  e~w{,\ 

and  we  are  interested  in  the  differential  of  the  product.  In  order  to  apply  the  one- 
factor  product  rule,  we  need  the  differentials  of  the  factors.  For  X\  it  obviously  holds 
that:  dX\  —  dW.  For  e~w^r)  Example  11.2  yields 

— w 

dX2  =  de~w  =  —e~wdW  +  - — dt . 

2 


Hence,  we  have 


CTi  (0  =  1,  o2(t)  =  -e  wu>. 


Plugging  in  into  the  product  rule  (11.2)  yields: 

d  (We~w)  =  e~wdX i  +  WdX2  -  e~wdt 

/  e~w  \ 

=  e~wdW  +  W  I  —e~wdW  +  —  dt  1  -  e~wdt 

(t  -1)‘" +  -w)dw- 


11.7  As  g(W)  —  is  a  simple  function  of  W ,  Corollary  11.1  yields  a  direct 
approach  to  the  differential.  For  this  purpose,  we  only  need  the  derivatives  (quotient 
rule): 


s\W)  = 


ew-Wew 

„2W 


1  -  w 


-ew  -  (1  -W)ew  W  —  2 


,2  W 


g"(W)  = 
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Thus,  it  follows  from  Ito’s  lemma  that 


d 


W 

~oW 


g'(W)dW  +  I g"(W)dt 

1  —  IT  IT-2 
dW  H —  dt 


e 

-w 


w 


2ew 


IT 


e  i  —  -  1  )  dt  +  e~w(l  -  W)dW. 


Of  course,  this  result  coincides  with  the  one  from  the  previous  problem. 
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12.1  Summary 

In  the  following  section  we  discuss  the  most  general  stochastic  differential  equation 
considered  here,  whose  solution  is  a  diffusion.  Then,  linear  differential  equations 
(with  variable  coefficients)  will  be  studied  extensively.  Here  we  obtain  analytical 
solutions  by  Ito’s  lemma.  We  discuss  special  cases  that  are  widespread  in  the 
literature  on  finance.  In  the  fourth  section  we  turn  to  numerical  solutions  allowing 
to  simulate  processes.  The  sample  paths  displayed  in  the  figures  of  Chap.  13  are 
constructed  that  way. 


1 2.2  Definition  and  Existence 

After  a  definition  and  a  discussion  of  conditions  for  existence,  we  will  consider  the 
deterministic  case  as  well.  Deterministic  differential  equations  are  embedded  into 
the  stochastic  ones  as  special  cases. 


Diffusions 

We  defined  the  solution  of 

dX(t )  =  fi(t)  dt  +  o(t)  dW(t) 

as  a  diffusion  process,  where  /x(t)  and  cr(t)  are  allowed  to  depend  on  t  and  on  X(t) 
itself.  As  the  most  general  case  of  this  chapter  we  consider  diffusions  as  in  (10.6): 

dX(t)  =  fi(t,X(t))  dt  +  <j(t,X(t))dW(t)  ,  t  E  [0,  T] .  (12.1) 
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The  solutions* 1  of  such  differential  equations  can  also  be  written  in  integral  form: 

X(i)  =  X(0)  +  f  p(s,X(s))  ds  +  f  a(s,X(s))  dW(s) ,  t  e  [0,  T] .  (12.2) 

Jo  Jo 

Under  what  conditions  is  such  a  definition  possible?  In  other  words:  Which 
requirements  have  to  be  met  by  the  functions  fi(t,  x )  and  o(t,  v),  such  that  a  solution 
of  (12.1)  exists  at  all  -  and  uniquely  so?  This  mathematical  aspect  is  not  to  be  overly 
deepened  at  this  point,  however,  neither  is  it  to  be  completely  neglected.  We  consider 
stronger  but  simpler  sufficient  conditions  than  necessary.  For  a  profound  discussion 
see  e.g.  0ksendal  (2003).  The  first  assumption  requires  that  /x  and  a  are  smooth 
enough  in  the  argument  v:2 

(El)  The  partial  derivatives  of  p(t,x)  and  a(t,x )  with  respect  to  x  exist  and  are 
continuous  in  x. 

Secondly,  we  maintain  a  linear  restriction  of  the  growth  of  the  diffusion  process: 
(E2)  There  exist  constants  K\  and  K 2  with 

\fi(t,x)\  +  \c>(t,x)\  <  Ki  +  K2\x\  . 

And  finally  we  need  a  well  defined  starting  value,  which  may  be  stochastic: 

(E3)  X(0)  is  independent  of  W(t)  with  E{X2( 0))  <  00. 

Under  these  assumptions  0ksendal  (2003,  Theorem  5.2.1)  proves  the  following 
proposition. 

Proposition  12.1  (Existence  of  a  Unique  Solution)  Under  the  assumptions  (El) 
to  (E3),  Eq.  (12.1)  has  a  unique  solution  X(t)  of  the  form  (12.2)  with  continuous 
paths  and  E(X2(t))  <  00. 

The  assumption  (E3)  can  always  be  met  by  assuming  a  fixed  starting  value.  The 
second  assumption  is  necessary  for  the  existence  of  a  (finite)  solution  while  (El) 


Strictly  speaking,  this  is  a  so-called  “strong  solution”  in  contrast  to  a  “weak  solution”.  For  a  weak 
solution  the  behavior  of  X(t)  is  only  characterized  in  distribution.  We  will  not  concern  ourselves 
with  weak  solutions. 

2Normally,  one  demands  that  they  satisfy  a  Lipschitz  condition.  A  function  /  is  called  Lipschitz 
continuous  if  it  holds  for  all  x  and  y  that  there  exists  a  constant  K  with 

I  fix)  —f(y)\  <K\x-y\. 


We  can  conceal  this  condition  by  requiring  the  stronger  sufficient  continuous  differentiability. 
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guarantees  the  uniqueness  of  this  solution.  This  is  to  be  illustrated  by  means  of  two 
deterministic  examples. 

Example  12.1  ( Violation  of  the  Assumptions)  We  examine  two  examples  known 
from  the  literature  on  deterministic  differential  equations  where  o(t,x)  —  0. 
Similar  cases  can  be  found  e.g.  in  0ksendal  (2003).  In  the  first  example  we  set 
li(t,x(t))  —  X2/3(r): 


dX(t)  =  X2/3(t)  dt ,  X(0)  =  0 ,  t  >  0  . 

We  define  for  an  arbitrary  a  >  0  infinitely  many  solutions: 


t  <  a 
t  >  a . 


By  differentiating  one  can  observe  that  any  Xa(t)  indeed  satisfies  the  given  equation. 
The  reason  for  the  ambiguity  of  the  solutions  lies  in  the  violation  of  (El)  as  the 
partial  derivative, 


d/i(t,x)  _  2  __1/3 

_  —  X  , 

dx  3 

does  not  exist  at  v  =  0. 

The  second  example  reads  for  p(t,X(t))  =  X2(t): 

dX(t)  =  X2(t )  dt ,  X(0)  =  1 ,  t  e  [0, 1) . 

Again,  by  elementary  means  one  proves  that  the  solution  reads 

x(t)  =  (i  -  o_1  ,  o  <  t  <  l , 

and  hence  tends  to  oo  for  t  1.  The  reason  for  this  lies  in  a  violation  of  (E2):  The 
quadratic  function  p(t,  v)  =  x2  cannot  be  linearly  bounded.  ■ 


Linear  Coefficients 

In  order  to  be  able  to  state  analytical  solutions,  we  frequently  restrict  generality  and 
consider  linear  differential  equations: 

dX(t)  =  (c\(t)X(t)  +  C2 (t))  dt  +  (o\(t)X(t)  +  (72 (t))  dW(t)  ,  t  >  0  ,  (12.3) 

where  the  variable  coefficients  eft)  and  oft),  i  —  1,2,  are  continuous  deterministic 
functions  of  time.  Here,  X{t)  enters  pt  and  o  just  linearly.  Obviously,  the  partial 
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derivatives  from  (El)  are  constant  (in  x)  and  thus  continuous.  In  addition,  one 
obtains  a  linear  bound: 


I li(t,x)\  +  \v(t,x)\  <  \ci(t)\  \x\  +  |c2(0l  +  ki(OI  M  +  1^2(01 


=  (1^(01 +  M0I)  \x\  +  (\c2(t)\  +  \a2m 


<  K2  \x\  +  K\  . 


As  Ci(t)  and  Oi(t)  are  continuous  in  t  and  hence  bounded  for  finite  t ,  positive 
constants  K\  and  K2  can  be  specified  such  that  the  inequality  above  holds  true. 
Therefore,  (E2)  is  satisfied.  Therefore,  a  unique  solution  exists  for  linear  stochastic 
differential  equations.  What  is  more:  Ito’s  lemma  will  allow  as  well  for  the 
specification  of  an  explicit  form  of  this  analytical  solution  from  which  one  can 
determine  first  and  second  moments  as  functions  of  time.  The  next  section  is 
reserved  for  studying  equation  (12.3).  Before,  we  consider  the  borderline  case  of 
a  deterministic  linear  differential  equation. 


Deterministic  Case 

By  setting  o\  (t)  =  02  (t)  =  0  in  (12.3),  we  obtain  a  deterministic  linear  differential 
equation  (in  small  letters  to  distinguish  from  the  stochastic  case), 

dx{t)  —  ( c\(t)x(t )  +  C2 (0)  dt ,  t  >  0,  (12.4) 


or  as  well 


x(t)  =  C]  (t)x(t)  +  c2  (t)  . 


Frequently,  one  speaks  of  first  order  differential  equations,  as  only  the  first  derivative 
is  involved.  As  is  well  known,  the  solution  reads  (see  Problem  12.1) 


with 


z(t)  =  exp 


(12.5) 


(12.6) 


For  02(f)  —  0  one  obtains  from  (12.4)  the  related  homogeneous  differential 
equation  (with  starting  value  1), 


dz(t)  =  c  1  (t)  z(t)  dt ,  z(0)  =  1 , 
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which  just  has  z(t)  from  (12.6)  as  a  solution.  The  following  example  presents  the 
special  case  of  constant  coefficients. 

Example  12.2  ( Constant  Coefficients)  In  the  case  of  constant  coefficients, 

C\(t)  —  c\  —  const ,  C2 (t)  =  C2  =  const , 


the  solution  from  (12.5)  simplifies,  see  Problem  12.1 


x(t)  —  e 


_  C\t 


X 


C2 


(0)  +  —  (l  —  e~ci‘) 


c  1 


=  eC1' 


v(0)  +  — 

c  1 


£2 

Cl 


Hence,  for  negative  values  of  c\  it  holds  that  the  equation  is  stable  in  the  sense  that 
the  solution  tends  towards  a  fixed  value: 

t->oo  C2 

x[t)  — > - —‘.pi ,  ci  <  0  . 

Cl 


Basically,  one  can  already  observe  this  from  the  equation  itself: 


dx(t)  —  (ci  x(t)  +  c 2)  dt 
—  ci  (x(t)  —  pi)  dt . 

Namely,  if  x{t)  lies  above  the  limit  pi ,  then  the  expression  in  brackets  is  positive  and 
hence  the  change  is  negative  such  that  x(t)  adjusts  towards  the  limit  pi.  Conversely, 
x(t)  <  pi  causes  a  positive  derivative  such  that  x{t)  grows  and  moves  towards  the 
limit.  All  in  all,  for  ci  <  0  a  convergence  to  pi  is  modeled.  ■ 


In  the  following,  we  will  see  that  the  solution  of  the  deterministic  linear  equation 
is  embedded  into  the  stochastic  one  for  01  (t)  =  02(0  =  0. 


1 2.3  Linear  Stochastic  Differential  Equations 

For  the  solution  of  the  equation  (12.3)  we  expect  a  similar  structure  as  in  the 
deterministic  case,  (12.5),  i.e.  a  homogeneous  solution  as  a  multiplicative  factor 
has  to  be  expected.  Hence,  we  start  with  the  solution  of  a  homogeneous  stochastic 
equation. 
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Homogeneous  Solution 

For  c2(0  =  cr2(0  =  0  one  obtains  from  (12.3)  the  corresponding  homogeneous 
linear  equation.  In  doing  so,  we  rename  X  and  choose  1  as  the  starting  value  : 

dZ(f)  =  ci  (0  Z(f)  dt  +  o\  (t)Z(t)  dW(t ) ,  Z(0)  =  1 .  (12.7) 

Now,  Ito’s  lemma  (Proposition  11.1)  is  applied  to  g(Z(t))  —  log(Z(t)).  Thus,  we 
obtain  as  the  solution  of  (12.7), 

Z(t)  =  exp  |  J  ^ci(s)  —  -of(iS’)^  ds  +  J  <7i(s)  cW(y)j  ,  (12.8) 

see  Problem  12.2.  Hence,  for  G\(t)  —  0  the  deterministic  solution  from  (12.6) 
is  reproduced.  The  solution  with  an  arbitrary  starting  value  different  from  zero 
therefore  reads 

X(t )  =  X(0)  exp  |  J  ^ci(s)  —  ^  of  (s)^  ds  +  J  cri  (v)  c/ W (v)  |  . 


General  Solution 


Let  us  return  to  the  solution  of  equation  (12.3).  Now,  analogously  to  the  determin¬ 
istic  case  (12.5),  let  us  define  Z(t)  from  (12.8)  as  a  homogeneous  solution.  At  the 
end  of  the  section  we  will  establish  the  following  proposition  whilst  applying  two 
versions  of  Ito’s  lemma.  Two  interesting,  alternative  proofs  will  be  given  in  exercise 
problems. 


Proposition  12.2  (Solution  of  Linear  SDE  with  Variable  Coefficients)  The  solu¬ 
tion  of  (12.3)  with  in  t  continuous  deterministic  coefficients  is 


m  =  z  (t> 


x(0)+  + 

Jo  Z(s) 


/ 


02  0) 
Z(s) 


dW(s) 


(12.9) 


with  the  homogeneous  solution 


Z(t)  =  exp  |  j  |ci (s)  -  ds  +  [  G\  (s)  dW(s) 


3The  renaming  justifies  the  assumption  regarding  the  starting  value.  Consider 

dX(t)  =  ci  (0  X(t)  dt  +  o'!  (t)X(t)  dW(t)  ,  X(0)  ±  0  , 

with  a  starting  value  different  from  zero,  then  by  division  one  can  normalize  Z(t)  =  Z(t)/Z(0). 
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For  o\(t)  —  02(0  =  0  we  again  obtain  the  known  result  of  a  deterministic 
differential  equation,  cf.  (12.5). 


Expected  Value  and  Variance 

The  process  defined  by  (12.3)  reads  in  integral  notation 

X(t)  =  X(0)  +  f  (c\(s)X(s)  +  c2(s))  ds  +  f  (cti  (s)X(s)  +  cr2(s))  dW(s )  . 

Jo  Jo 

Let  us  define  the  expectation  function  as 

:=  E(X(0) , 

then  it  holds  due  to  Propositions  8.2  (Fubini)  and  10.3  that: 

=  E  (X(0))  +  f  (ci(s)  E  (X(s))  +  c2(s))  ds  +  0 

Jo 

=  \i\ (0)  +  /  (ci  (, s )  ii\ (s)  +  c2(s))  ds  . 

Jo 

This  corresponds  exactly  with  the  deterministic  equation  (12.4).  Hence,  the  solution 
is  known  from  (12.5)  and  one  obtains  the  form  given  in  Proposition  12.3.  The 
derivation  of  an  expression  for  the  second  moment  is  somewhat  more  complex, 

M2O)  :=  E  (X2(t)) , 


see  Problem  12.3. 

Proposition  12.3  (Moments  of  the  Solution  of  a  Linear  SDE) 

assumptions  of  Proposition  12.2  it  holds  that 


Mi  (t)  =  z(t) 

Mi(0)  + 

f  ?>  * 

Jo  z(s) 

,  z(t)  =  exp  < 

j  j  c'i  (.v)c/.v! 

and 

iMt)  =  t,(t) 

P'  2(0)  + 

f  *1 

0  too  j 

,  £  (t)  =  exp  J 

)7i(.vk/.v| 

Under  the 


(12.10) 


(12.11) 


where 


Yi (t)  =2 ci (t)  +  of  (t) ,  y2(t)  =  2  [c2(t)  +  d (f)  o2(t)\  Mi  (0  +  of  (/) . 
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Example  12.3  (Homogeneous  Linear  SDE  ( Constant  Coefficients))  Since  the 
works  by  Black  and  Scholes  (1973)  and  Merton  (1973)  one  assumes  for  the  stock 
price  X(t)  the  model  of  a  homogeneous  linear  SDE  with  constant  coefficients  (and 
starting  value  X(0),  cf.  (1.3)): 

dX(t )  =  c\  X(t)  dt  +  <Ti  X(t)  dW(t)  . 

The  solution  resulting  from  (12.9)  or  rather  from  Proposition  12.2  is  a  geometric 
Brownian  motion, 


X(t)  =  X(0)  exp  |  ^  of  J  t  +  g i  W(t) 

This  process  has  already  been  discussed  in  Chap.  7.  With  the  generally  derived 
formulas  we  can  now  recheck  the  moment  functions  from  (7.9).  Proposition  12.3 
yields  (see  Problem  12.4) 


Mi  (0  =  Mi(0)  exp(ci  t)  , 
l-ii(t)  =  M2 (0)  exp  {(2 Cl  +  a,2)  r}  . 
Now,  assume  a  fixed  starting  value  X(0).  Then,  it  holds  that 

fix  (0)  =  X(0)  and  /x2(0)  =  X2(0) , 


and  hence 


Var(X(t))  =  M2O)  -  Mi(0 

=  X2(0)  exp  (2c\t)  (exp  (af  t)  —  l)  . 

With  X(0)  =  1,  /x  =  ci  —  | of  and  a  =  ori  this  corresponds  to  the  notation  from 
Chap.  7.  The  moments  from  (7.9)  are  indeed  reproduced.  ■ 


Inhomogeneous  Linear  SDE  with  Additive  Noise 

For  C2(t)  0  the  linear  SDE  is  inhomogeneous.  However,  at  the  same  time  the 

increments  of  the  Wiener  process  (“noise”)  enter  into  (12.3)  additively,  i.e.  G\(t)  — 
0: 


dX(t)  —  (c\(t)X(t)  +  C2 (t))  dt  +  G2(t)dW(t) . 


(12.12) 
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The  solution  results  from  (12.9)  in  Proposition  12.2  as 


X(t)  =  z(t) 


X(0)  + 


f 


Clip 

z(s) 


ds  + 


f 


02 («) 

z(s) 


dW(s) 


(12.13) 


where  z(t )  is  a  deterministic  function: 


!/' 


z(t)  =  exp  c\(s)  ds 


Note  thatX(^),  as  a  Stieltjes  integral,  is  a  Gaussian  process  due  to  Proposition  9.2.  Its 
moments  result  correspondingly  (for  a  fixed  starting  value  X(0)).  We  collect  these 
results  in  a  corollary. 


Corollary  12.1  (Additive  Noise)  The  solution  of  (12.12)  with  in  t  continuous 
deterministic  coefficients  is  given  by  (12.13).  The  starting  value  X(0)  be  determin¬ 
istic.  Then,  the  process  is  Gaussian  with: 


i-i](t)  = 


z(t )  =  exp 


Var(X(t ))  =  z2(t) 


O2O) 

z(s) 


ds . 


(12.14) 

(12.15) 


We  illustrate  the  corollary  with  the  following  example. 


Example  12.4  (Convergence  to  Zero)  As  a  concrete  example,  let  us  consider  the 
process  given  by  the  following  equation  with  starting  value  0: 

dW(t) 

dX(t)  =  -X(f)  dt  +  .  w  ,  t  >  0  ,  X(0)  =  0  . 

vH -t 

This  equation  is  a  special  case  of  additive  noise  as  it  holds  that  c>\(t)  —  0.  The 
remaining  coefficient  restrictions  read: 


Ci(0  =  -1,  C2(t)  =  0, 


a2{t) 


I 

7 f+r' 


What  behavior  is  to  be  expected  intuitively  for  X(t)l  The  volatility  term,  07 (t),  tends 
to  zero  with  t  growing;  does  this  also  hold  true  for  the  variance  of  the  process?  And 
c\(t)  —  —  1  implies  that  positive  values  influence  the  change  negatively  and  vice 
versa;  does  the  process  hence  fluctuate  around  the  expectation  of  zero?  In  fact,  we 
can  show  that  the  process  with  vanishing  variance  varies  around  zero  and  therefore 
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converges  to  zero.  For  this  we  need  the  first  two  moments.  These  can  be  obtained 
from  (12.14)  and  (12.15): 


E (X(t))  =  0  , 


Var(X(t))  =  e 


-it 


rt 

/  - ds. 

Jo  1  +  S 


What  can  be  learned  from  this  about  the  variance  for  t  increasing?  In  Problem  12.7 
we  show 


/' 

Then,  this  proves  Var(X(t))  — > 
tends  to  zero  in  mean  square. 


e2s  e2t 

ds  < - 1 


1  +  J 


1  +  t 


0  for  t  ->  oo.  Hence,  it  is  obvious  that  X(t)  indeed 


Proof  of  Proposition  1 2.2 

With  the  homogeneous  solution 

Z(t)  :=  exp  |  J  ^ci(^)  -  ^a2(s)^  ds  +  J  ai(^)<iW(^)|  , 

of 

dZ(t )  =  ci  ( t )  Z(t)  dt  +  G\  ( t)Z(t )  dW(t) 
we  define  the  two  auxiliary  quantities 

X\(t)  :=ZTl(t),  X2(t)  :=X(t). 

Note  that  X(t)  is  the  process  defined  by  (12.3)  such  that  the  differential  of  X2(t)  is 
shown  in  (12.3).  The  proof  suggested  here  uses  the  product  rule  for  d  (X\  ( t )  ^(t))- 
However,  for  a  valid  application  the  derivation  of  dX\  ( t )  is  necessary  as  well. 

As  a  first  step  we  use  Ito’s  lemma  in  the  form  of  Proposition  11.1  in  order  to 
determine  the  differential  for  X\  ( t )  with 

g(Z)  =  Z~l,  g'(Z)  =  -Z-2 ,  g"(Z)  =  2Z~\ 


4For  this  purpose  we  do  not  need  an  explicit  expression  for  the  process  which,  however,  can  be 
easily  obtained  from  (12.13)  withX(O)  =  0: 


X{t)  =  e 


-f 


VT+ 


:dW(s). 
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The  differential  becomes 


dX,(t)  =  g'(Z(t))dZ(t)  +  1  g"(Z{t))  of  (?)  Z2(t)  dt 

_  Cl  (?)  Z(t )  dt  +  (i i  (?)Z(?)  dW(?)  2  crf(0  Z2  (?) 

_  zHt)  +  2  z3(?)  dt 

=  <xf(?)-Ci(?)  gifr) 

z(?)  Z(r)  1  j 

=  (a2  (?)  —  ci  (?))  A"  i  (?)  c/?  —  ai  (?)  X|  (?)  dW(t) . 


In  a  second  step,  we  can  now  apply  the  stochastic  product  rule  (see  Eq.  (11 .2))  as  an 
implication  of  Proposition  1 1 .2  to  the  auxiliary  quantities^ : 


d  (X]  (?)  X2(t))  =  Xx  (?)  dX2(t)  +  X2(t)  dX i  (?) 

-  (ai  (?)  X2(?)  +  o2(t))  o\  (?)  X]  (?)  dt . 


If  the  differentials  dX\(t)  and  dX2(t)  are  plugged  in,  then  some  terms  cancel  each 
other  such  that  it  just  remains: 


d  (X\  (?)  X2(?))  =  Xi  (?)  (c2(?)  dt  +  a2(?)  dW(t))  -  ai  (?)  ct2(?)Xi  (?)  dt 


c2(?)  -ai(?)a2(?) 

Z(?) 


dt 


02(1) 

Z(t) 


dW(t ) . 


Due  to 


Xi  (?)  X2  (?) 


m 

Z(?)  ’ 


it  follows  by  integrating  in  a  third  step: 


m 

Z(?) 


^(Q) 

Z(0) 


C2(^)  -01(5)02(5) 
Z(v) 


+ 


Z(.s) 


dlP(^) . 


As  Z(0)  =  1,  we  have  established  (12.9)  and  hence  completed  the  proof.  Two 
alternative  proofs,  which  are  again  based  on  Ito’s  lemma  (or  implications  thereof), 
are  covered  as  exercise  problems. 


5There  is  the  risk  of  confusing  the  symbols  cr, ,  i  =  1,2,  from  Eq.  (12.3)  with  the  ones  from 
Eq.  (11.1).  Note  that  the  volatility  of  Xi  (i.e.  “cti ”)  is  given  by  —axXi  while  the  volatility  term 
“02”  °f  X2  just  reads  <7i  X2  +  a2 ! 
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1 2.4  Numerical  Solutions 

Even  if  an  analytical  expression  for  the  solution  of  a  SDE  is  known,  numerical 
solutions  in  the  sense  of  simulated  approximations  to  paths  of  a  process  are  of 
interest.  Such  a  simulation  of  a  solution  is,  on  the  one  hand,  desired  for  reasons 
of  a  graphic  illustration;  on  the  other  hand,  in  practice  a  whole  family  of  numerical 
solutions  is  simulated  in  order  to  obtain  a  whole  scenario  of  possible  trajectories. 

Euler  Approximation 

The  interval  [0,  T\  from  (12.1)  is  divided  w.l.o.g.  in  n  equidistant  intervals  of  the 
length  The  corresponding  partition  reads: 


T 

0  =  to  <  t\  =  —  <  ...  <  tj  = 

n 


iT 


n 


<'C  ...  tyi  -  /  . 


The  theoretical  solution  from  (12.2)  of  an  arbitrary  diffusion  is  now  considered  on 
the  subinterval  [f;_ i ,  £*■],  i  =  1 , ,n\ 


This  allows  for  the  following  approximation  as  it  is  discussed  e.g.  in  Mikosch 


(1998): 


which  can  also  be  written  as: 


X  (ti)  ^  X  fa- 1) 


+  M  (ti- l’X  (ti- 1)) 


T 


+  a  (ti-i,X(ti-i))  (W  (ti)  -W(ti- 0)  . 


n 


For  this  purpose 


6 In  the  literature,  one  speaks  of  an  Euler  approximation.  An  improvement  is  known  under  the 
keyword  Milstein  approximation.  In  order  to  explain  what  is  meant  by  “improve”  in  this  case,  one 
would  have  to  become  more  involved  in  numerics. 
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was  used.  Hence,  we  have  a  recursive  scheme.  Given  Xo  =  X(0)  one  calculates  for 
i  =  1: 


Xi  =  X0  +  /x(0,X0)- 

n 


+  <7(0,  Xq) 


and  in  general,  for  i  —  1 , ,n: 

Xi=Xi-l+fjL(ti-UXi-l)  -  +o(ti-UXi-l)  0)  .  (12.16) 

n 

Thus  we  obtain  n  observations  X;  (i.e.  n  +  1  observations  including  the  starting 
value),  with  which  a  path  of  the  continuous-time  process  X(t)  on  [0,  T]  is  simulated. 
However,  this  simulation  requires  Gaussian  pseudo  random  numbers  in  (12.16), 

w 0 ti )  -  W  (ti- 1)  =  W  (^-pj  -  W  ~J)7j  ~  iW  ^0,  p  . 

For  this  purpose,  a  series  of  stochastically  independent  AT  (0,  ^-distributed  random 
variables  e*  need  to  be  simulated  instead  of  W(u)  —  W(U- 1),  in  order  to  obtain  a 
numerical  solution  Xu  i  —  1 , ,n  for  the  diffusion  X(t)  from  (12.2)  according 
to  (12.16).  Naturally,  with  n  growing  the  approximation  of  a  numerical  solution 
improves. 

1 2.5  Problems  and  Solutions 

Problems 

12.1  Show  that  the  function  given  in  (12.5)  solves  the  deterministic  differential 
equation  (12.4).  How  does  it  look  like  in  the  case  of  constant  coefficients? 

12.2  Show  that  Z{t)  from  (12.8)  solves  the  homogeneous  SDE  (12.7)  with  Z(0)  = 
1. 

Hint:  See  the  text. 

12.3  Prove  (12.11)  from  Proposition  12.3. 

Hint:  Determine  for  g  (X(^))  =  X2(t)  an  expression  with  Ito’s  lemma. 

12.4  Derive  the  expectation  and  the  variance  of  the  geometric  Brownian  motion 
with  Proposition  12.3, 


t  +  G\  W(t) 


X(t)  =  X(0)  exp 
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12.5  Determine  the  process  X(t )  for  which  it  holds  that: 

dX(t)  =  X(t)  dW(t)  ,  X(0)  =  1  . 


Hint:  Proposition  12.2. 


12.6  Find  the  solution  of 


dX(t)  = 


-m 

1  +  t 


dt  H- 


dW(t) 

1  +  t’ 


t>  0, 


for  X(0)  =  0.  Show  that  it  tends  to  zero  in  mean  square. 
Hint:  Proposition  12.2. 


12.7  Show  for  the  Example  12.4: 


l+s 


ds  < 


1  +  t 


12.8  Determine  the  solution  of 


dX(t)  = 


m 

i -t 


dt  +  dW  (t)  , 


0  <  t  <  1 , 


with  X(0)  =  0.  Show  that  Var(X(t))  =  (1  —  t)  t  and  hence  that  X(t)  tends  to  zero  in 
mean  square  for  t  ->  1.  (This  reminds  us  of  the  Brownian  bridge,  see  (7.6).  In  fact, 
the  above  SDE  defines  a  Brownian  bridge,  cf.  Grimmett  &  Stirzaker,  2001,  p.  535.) 


12.9  Prove  Proposition  12.2  by  directly  applying  Proposition  11.2. 

Hint:  Choose  g(X,  Z)  —X/Z. 

12.10  Prove  Proposition  12.2  with  the  quotient  rule  from  (1 1.5). 

Hint:  First  derive  the  quotient  rule  for  the  one-factor  case  {d  —  1)  as  a  special  case 
of  (11.5). 


Solutions 

12.1  The  solution  from  (12.5)  reads 
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with 


a. 


z(t)  —  exp  c  i  (s)ds 


Let  us  define  the  square  bracket  as  b(t): 


bit )  = 


*(0)  +  f 

Jo  z(s) 


x(t) 

z(t) 


and 


b' it)  = 


ciit)_ 

z(t) 


The  derivative  of  zit)  is 


z'(t)  —  ci it) exp  |  j  c'i (.vic/.vj  =  ci(t)z(t) 


Hence,  the  product  rule  yields: 


x'(t)  =  z!(t)  b(t)  +  z{t)  b'(t) 

ro  i+\  x(0  ,  /a  c2(0 

=  Ci  (t)  zit)  —  +  z{t)  — — 
zit)  z{t) 

=  Ciit)x(t)  +  C2it)  , 
which  just  corresponds  to  the  claim. 

In  the  case  of  constant  coefficients,  x(t)  from  (12.5)  with  z(t)  —  eClt  becomes: 


x{t)  = 


eCit[xiO)  +  c2  f  e~c'sds\ 

Jo 

eClt[x(0)  -  —(e-c'‘  -  \)] 

Cl 


_  C\ t 


c2  \  C2 


l  x(0)  H - - 

Cl  /  Cl 


12.2  For  Z(t)  from  (12.7)  it  holds  that 


dZ(t )  =  ji(t,Z(t))dt  +  a(t,Z(t))dW(t) 


with 


fi(t,Z(t))  =  ci(t)Z(t),  a(t,Z(t ))  =  G\  (t)Z(t). 
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Therefore,  Proposition  11.1  yields: 


dg{Z(t))  =  g' (Z(t))dZ(t)  +  X-g"(Z(t))o2{t,Z(t))dt. 


With 


g(x)  =  log(x).  g'(x)  =  g"(x)  =  -f . 


we  hence  obtain 


d  log(Z(f))  = 


[i(t,Z(t))dt  +  cr(t,Z(t))dW(t )  1  a2(t,Z(t)) 


At) 


2  Z20 ) 


(it 


=  c\(t)dt  +  o'  i  (t)dW(t)  — 


of(0 


(it 


Integration  yields 


log(Z(0)  =log(Z(0))+  2 


of(X) 


f 


ds -\-  /  o'i(ks,)JlT(ks’) 

ro 


Because  of  Z(0)  =  1,  the  exponential  function  yields  as  desired: 


ds  +  J  a i  (s)dW(s) 


12.3  Proposition  11.1  with 

g(x)  =  x2,  g(x)  =  2x,  g"(x)  =  2 

is  applied  to  X2  where  the  differential  dX(t)  is  given  by  Eq.  (12.3).  This  leads  to 

dX2(t)  =  2 X(t)dX(t)  +  (oi(t)X(t)  +  o2(t))2dt 

—  [2 X(t)  (c\(t)X(t)  +  c2(t))  +  {o\(t)X(t)  +  02(t))2]dt 
+2X(f)(oi(f)X(f)  +  o2(t))dW(t). 

As  an  integral  equation  this  reads  as  follows: 


X2(f)  =  X2(0)  +  f  [2X(s)  (ci(s)X(s)  +  £’2 (.?))  +  (a,  (.s)X(.s)  +  ct2(s)): 

Jo  1 

+2  f  X(s)(ai(s)X(s)  +  o-2(^))<ilT(^). 

Jo 


ds 
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The  expectation  of  the  second  integral  is  zero  due  to  Proposition  10.3.  The 
expectation  of  the  first  integral  results  due  to  Fubini’s  theorem  as 


J  o 

With  the  definition  of  (s)  and  fi 2(s)  it  thus  follows  that 


J  o 

In  differential  notation  this  equation  reads 

d/i2(t)  =  (yi(t)ix2(t)  +  y2(t))  dt , 

where  the  functions  y\(t)  and  y2(t)  in  the  proposition  following  (12.11)  were 
adequately  defined.  Therefore,  the  second  moment  results  as  the  solution  of  a 
deterministic  differential  equation  of  the  form  (12.4).  Its  solution  can  be  found 
in  (12.5).  Hence,  the  proposition  is  verified. 

12.4  The  geometric  Brownian  motion  solves  the  homogeneous  linear  equation  with 
constant  coefficients: 

c2(t)  —  cr2(t)  =  0,  c\(t)  —  c\  —  const.,  G\  (t)  —  G\  —  const. 

For  a  stochastic  starting  value  it  holds  that: 


IM  (0)  =  E(X(())).  H2( 0)  =  E(V (())). 


By  plugging  in,  Proposition  12.3  yields 


=  eClt[/i\(0 )  +  0]  =  /x i(0)  exp(ci0  . 


With  the  definitions  from  Proposition  12.3  one  determines 


y\(t)  =  2  ci  +  g\  =:  yi  and  y2(t)  =  0  . 


Hence,  substitution  yields 


n2(t)  =  en'[ii2( 0)  +  0]  =  fi2( o)  exp(yi0  . 
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Thus,  the  variance  is  calculated  as 

Var(X(f))  =  Hi(t)  -  n\(t) 

=  /z2 (0)  exp{(2ci  +  of)/}  -  /if  (0)  exp{2ci?}. 

12.5  The  equation  at  hand  is  linear  with 

C'i(f)  =  c2(/)  =  0,  a2(t)  =  0. 

Furthermore,  it  holds  that 


0\  (/)  =  1. 

Therefore,  the  solution  deduced  from  Proposition  12.2  reads: 


X(t)  =  Z(t)[X(  0)  +  0] 


with 


z(0  =  exp  +  w(/)}  • 

In  particular  for  X(0)  =  1  (analogously  to  e°  —  1)  it  hence  holds  that: 

X(f)  =  exp{w(f)-tJ. 

Due  to  the  analogy  to  der  =  ef  dt  with  e°  —  l  this  process  X(t )  is  sometimes  called 
“Ito  exponential”.  It  is  noteworthy  that  the  Ito  exponential  is  not  given  by  exp{IT(t)}. 

12.6  The  equation  is  linear,  see  Eq.  (12.3),  and  corresponds  to  the  special  case  of 
additive  noise,  cf.  (12.12),  i.e.  G\  (t)  =  0.  The  remaining  coefficients  read: 


cx(t)  =  - 


1  +  t 


c2(f)  =  0,  a2(t)  = 


1  +  t 


Hence,  the  expression  for  the  solution  from  (12.13)  yields  with  X(0)  =  0: 


m  = 


Jo  (  1  +  s)z(s) 


dW(s), 


z(t)  =  exp  |  -  J  (1  +  s)  vds 
=  exp{— [log(l  +  s)Y0} 


-l 


where 
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=  exp{-  log(l  +  t)  +  0} 

,  .  1  \)  1 

=  exp  log 


1  +  t 


1  +  t 


Since  02 (f)/z(f)  =  1,  the  solution  simplifies  radically: 

X(t)  =  — —  f  dW(s)  = 

l+r/o  1  +  t 

For  this  solution  it  obviously  holds  that: 

E  (X(t))  =  0 


Var(X(r))  = 


(i  +  ty 


0,  t  00. 


Thus,  for  t  —>  00  we  have  established 


MSE  (X(t),  0)  =  E[(X(f)  -  0)2]  ->  0 

which  just  corresponds  to  the  required  convergence  in  mean  square. 
12.7  In  order  to  prove  the  inequality  claimed,  we  define  the  function 

1  e* 

gO)  =  - 


2  l  +  s 


with  the  derivative  (quotient  rule) 


g'(X>  = 


,2s 


1  e 


2s 


1  +  s  2  (1  +  s)2 


Let  us  call  the  integral  of  interest  /, 


_  r  ^ 

Jo  1  +  s 


ds. 


I  =  f  g(s)ds  +  i  f 
Jo  z  Jo 


as 


0  (1  +  s)' 

1  rf  p2s 

=  git)  -  g(0)  +  2  i 0 

<  git)  -  g(0)  +  ^  l 


ds 


(i  +  sy 


■ds 


Then  it  follows 
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where  the  bound  follows  from  (1  +  s)  <  (1  +  s)2.  By  rearranging  terms  it  results 
that 

I  <  2 (g(t)  -  g( 0)). 

With  the  definition  of  g  it  follows 


a  t 


I  < 


(1  +  0 


-l, 


which  was  to  be  shown. 

12.8  This  is  again  an  inhomogeneous  linear  equation  with  additive  noise: 

c\{t)  =  ,  c2(0  =  0,  o\  (0  =  0,  a2(0  =  1. 

With  X(0)  =  0,  X(t)  from  (12.13)  turns  out  to  be: 


m  = 


=  zit)  f 

Jo 


(z(s))-'  dW{s) 


with 


i.e. 


z(t )  =  exp  I  -  J  ^ ds\ 

=  exp  {[log(l  —  s)]q} 
=  1  -t. 


f‘  1 

X(t)  =  (!-/)  /  - dW(s). 

Jo  1  -s 


Due  to  C2(t)  =  X(0)  =  0,  (12.14)  yields: 


E(X(r))  =  0. 


Due  to  (12.15),  the  variance  is: 


Var(X(0)  =  (1  -  t)1  f  —J—^ds 

Jo  i1  _ 


=  (1  -  0: 


1 


-s) 

it 


L(i  -  OJ 


0 
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=  (i  -  ty 


i -t 


-l  = 


=  (i  -  ty 


i  -t 


—  (1  —  t)t. 


For  t  1  the  variance  shrinks  to  zero  such  that  it  holds  that  X(t)  tends  to  0  in  mean 
square: 


MSE(X(f),  0)  =  Var(X(f))  =  E[(X(r)  -  0)2]  0. 

12.9  The  key  problem  with  this  exercise  is  not  to  confuse  the  different  meanings 
of  < 7i(t ),  i  =  1,2,  in  Proposition  11.2  and  Eq.  (12.3).  Hence,  firstly  we  adapt 
Proposition  1 1.2  for  the  processes  X(^)  from  (12.3)  and  Z{t)  from  (12.7): 

dX(t)  —  /ix(t)  dt  +  <Jx(t)  dW(t)  , 

lix(t)  =  c\(t)X(t)  +  c2(t) ,  ax(t)  =  ax (t)X(t)  +  a2{t) , 

dZ(t )  =  /i-(f)  dt  +  CT-(f)  dW(t) , 

Hz(t)  =  ci  (t)  Z(t ) ,  az (t)  =  <j\  (t)  Z(t)  . 

Following  the  hint,  we  consider 


*(X,Z)  =  |  =xz 


-1 


with 


dg 

dx 


92g 

ax2 


0 


dg  j  d2g  ,  d2g 

— —  =  -XZ-2  ,  -4  =  2XZ-3  ,  —2-  =  - 

dZ  dZ2  dX  dZ 


Hence,  Ito’s  lemma  (Proposition  11.2)  yields: 


^  j  =  Z-1  dX  -  XZ~2  dZ+ 1  [0  +  2XZ-3  a2]  dt  -  Z~2ax  az  dt 

=  Z_1  (ciX  +  ci)  dt  +  Z_1  (p\X  +  a2 )  dW  -  XZ~2aZdt 
—XZ~2o\ZdW  +  XZ~3o2Z2  dt  -  Z~2{axX  +  a2)  o\Zdt 
—  (Z~lC2  —  Z~lG\G2)  dt  +  Z~XG2  dW. 
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Integration  yields: 


m 

m 


X(0) 
Z(  0) 


c2(X)  -qi(V)q2(V) 

Z(v) 


ds  + 


?  g2  (S) 

Z(s) 


dW (s) . 


If  this  equation  is  multiplied  by  Z(t),  then,  due  to  Z(0)  =  1,  one  obtains  the  desired 
result. 

12.10  We  apply  (11.5)  with 


X\  —  X  and  X2  —  Z , 

where  X  and  Z  are  driven  by  the  same  Wiener  process,  say  W\  =  W.  Then  the 
one-factor  quotient  rule  is  obtained  by  the  following  restrictions: 

G\\  =  gx  and  al2  =  0, 

021  =  orz  and  a22  =  0  . 

For  this  purpose,  ax  and  az  were  defined  in  the  previous  problem.  Then,  the  one- 
factor  quotient  rule  yields: 


'■I 


ZdX  -  XdZ  XZ~l  a:  -  axaz 


Z2 


Z2 


dt 


Z{c{X  +  C2)  dt  +  Z{<J\X  +  02)  dW  —  Xc\Zdt  —  Xo\ ZdW 

Z2 

XZ-la2Z2  -  (aiX  +  a2)aiZ 

+  — 1 — ^ - d, 

z2 

(c2  —  aia2) 


Z2 

dt+  —  dW 
Z 


As  before,  we  obtain  the  desired  result  by  integration  and  multiplication  by  Z(t), 
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13.1  Summary 


The  results  from  the  previous  chapter  will  be  applied  to  stochastic  differential 
equations  that  were  suggested  in  the  literature  for  modeling  interest  rate  dynamics. 
However,  we  do  not  model  yield  curves  with  various  maturities,  but  consider  the 
model  for  one  interest  rate  only  driven  by  one  Wiener  process  (one-factor  model). 
The  next  section  starts  with  the  general  Ornstein-Uhlenbeck  process  which  has  the 
drawback  of  allowing  for  negative  values.  Subsequently,  we  discuss  linear  models 
for  which  negativity  is  ruled  out.  Finally,  a  class  of  nonlinear  models  will  be 
considered. 


1 3.2  Ornstein-Uhlenbeck  Process  (OUP) 

We  have  already  encountered  the  standard  OUP  in  the  chapter  on  Stieltjes  integrals. 
Now,  we  discuss  the  general  case,  which  has  served  as  an  interest  rate  model  in  the 
literature  on  finance. 


Vasicek 


We  now  assume  constant  coefficients  for  the  inhomogeneous  linear  SDE  with 
additive  noise  in  (12.12): 

c\(t)  —  c\  —  const ,  C2 (t)  —  C2  =  const ,  02(f)  —  a 2  —  const ,  oft)  —  0 . 

This  defines  the  general  Ornstein-Uhlenbeck  process, 


dX(t)  —  (c  1  X(t)  +  C2 )  dt  +  02  dW(t )  . 


(13.1) 
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According  to  Corollary  12.1  the  solution  is 


X(t)  =  cC1' 


X(0) 


Cl 

Cl 


a2e~clsdW(s ) 


(13.2) 


In  particular  for  c2  —  0  and  o2  —  1  we  obtain  the  standard  OUP  with  X(0)  =  0 
from  (9.3)  in  Sect.  9.4.  The  following  equation  sheds  additional  light  on  the  OUP: 

dX(t )  =  c\  ( X(t )  —  /x)  dt  +  g2  dW(t ) ,  with  /x  :=  —  —  .  (13.3) 

ci 

In  this  manner,  Vasicek  (1977)  modeled  the  interest  rate  dynamics,  cf.  (1.7).  Due 
to  (13.2),  the  solution  of  this  interest  rate  equation  reads: 

X(t)  =  eClt 


X(0)  +  /x  (e  Clt  —  l)  +  f  o2 

Jo 


-CIS 


dW(s) 


Setting  the  starting  value  to  /x,  X(0)  =  /x,  then  one  obtains  the  form  immediately 
corresponding  to  the  standard  OUP  (9.3), 

X(t)  =  /i  +  ecit  f  o2  e~cis  dW(s)  , 

Jo 

with  expectation  /x.  From  (12.14)  and  (12.15)  we  obtain  for  an  arbitrary  fixed 
starting  value  X(0): 


Mi  (0  =  E  (X(t))  =  ec'‘X(0)  +  n  (1  -  ecu)  , 


(13.4) 


a 


Var (Z(0)  =  — ^  (1  -  e2ci')  • 

—2  ci 


(13.5) 


Particularly  the  mean  value  function  /Xi(t)  results  as  a  convex  combination  of  the 
long-term  mean  value  /x  and  the  starting  value  X(0).  For  ci  <  0,  these  moments  tend 
to  a  fixed  value  and  the  process  can  be  understood  as  asymptotically  stationary: 


ah  (0 


/x  for  ci  <  0  , 


Var(X(t)) 


— —  for  ci  <  0  , 

—2  ci 


where  the  limits  are  taken  as  t  ->  oo.  Processes  with  this  property  are  also  called 
“mean-reverting”.  The  adjustment  parameter  ci  <  0  measures  the  “speed  of  mean- 
reversion”,  i.e.  the  strength  of  adjustment:  The  smaller  (more  negative)  ci,  the  more 
strongly  dX(t)  reacts  as  a  function  of  the  deviation  of  X(t)  from  fi.  If  X(t)  >  /x,  then 
ci  <  0  causes  ci  (X(t)  —  /x)  to  have  a  negative  impact  on  X(t)  such  that  the  process 
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tends  to  decrease  and  therefore  approaches  /x;  conversely,  it  holds  that  for  X(t)  <  /x 
the  process  experiences  a  positive  impulse.  Furthermore,  one  observes  the  distinct 
influence  of  the  parameter  c\  on  the  (asymptotic)  variance:  The  smaller  the  negative 
c i,  the  smaller  is  the  asymptotic  expression;  however,  for  a  negative  c\  near  zero 
the  variance  becomes  large  and  the  OUP  loses  the  property  of  “mean  reversion”  for 
c\  —  0. 

Determining  the  autocovariance  function  for  c\  <  0  is  also  useful.  We  denote  it 
by  y(t,  t  +  h )  at  lag  h\ 


y(t,t  +  h)  —  E  [(A(t)  —  ji\(t))(X(t  +  K)  —  [i\{t  +  h))]  ,  h  >  0  , 


=  E 


ecitc>2 


f  e-clsdW(s)ecl(t+h)a2  f 

Jo  Jo 


t-\-h 


-cis 


dW(s) 


=  of  E  [XCI  (0  XCI  (i f  +  h)] , 


where  XCI  ( t )  is  the  standard  OUP  with  zero  expectation.  From  Proposition  9.4  we 
hence  adopt 


e2cit  —  1 

y(t,  t  +  h)  =  of  ^Cl/l  — - - 

2ci 

— >  — - ,  t  ->  oo  . 

2ci 

Thus,  for  a  large  t  there  results  an  autocovariance  function  only  depending  on  the 
temporal  distance  h.  All  in  all,  this  is  why  the  OUP  with  c\  <  0  can  be  labeled  as 
asymptotically  ( t  oo)  weakly  stationary. 


Simulations 

In  the  following,  processes  with  T  —  20  and  n  —  1000  are  simulated.  For  reasons 
of  graphical  comparability,  the  same  WP  is  always  assumed,  i.e.  the  1000  random 
variables  filtered  by  the  recursion  (12.16)  are  always  the  same  ones. 

First,  we  examine  the  impact  of  the  adjustment  parameter  c\  on  the  behavior  of 
the  OUP.  In  Fig.  13.1,  Oj  —  0.01.  As  we  want  to  think  about  interest  rates  when 
looking  at  this  figure,  expectation  and  starting  value  are  chosen  to  be  [i  —  5  (%). 
Here,  it  is  obvious  that  the  solid  line  deviates  less  strongly  from  the  expected  value 
for  ci  =  —0.9  and  is  thus  “more  stationary”  than  the  dashed  graph  for  ci  =  —0.1. 
This  is  evident  as  the  smaller  (i.e.  the  more  negative)  c i,  the  smaller  is  the  variance 
a2 /  (— 2ci)  for  t  growing. 

In  the  second  figure,  we  have  an  OUP  with  the  same  parameter  set-up  for 
ci  =  —0.9,  however,  with  a  starting  value  different  from  /x  =  5,  X(0)  =  5.1. 
Furthermore,  the  expected  value  function  /Xi(t)  is  given  and  one  can  observe  how 
rather  rapidly  it  approaches  the  value  /x  =  5  (Fig.  13.2). 
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Fig.  13.1  OUP  for  c\  —  —0.9  (Xi)  and  c\  =  —0.1  ( X2 )  (X(0)  =  /x  =  5,  02  =  0.01) 


Despite  the  convenient  property  of  mean  reversion,  the  OUP  is  only  partly 
suitable  for  interest  rate  modeling:  Note  that  the  process  takes  on  negative  values 
with  a  positive  probability.  This  is  due  to  the  fact  that  the  OUP,  as  a  Stieltjes  integral, 
is  Gaussian: 


X(t)  -A/'(/xi(0,Var(X(t)))  . 

Subsequent  to  Vasicek  (1977),  interest  rate  models  without  this  drawback  have  been 
discussed. 

1 3.3  Positive  Linear  Interest  Rate  Models 

For  now  we  stay  in  the  class  of  linear  SDEs,  however,  we  restrict  the  discussion  to 
the  case  in  which  positivity  (more  precisely:  nonnegativity)  is  guaranteed. 


Sufficient  Condition 

A  sufficient  condition  for  a  positive  evolution  of  the  solution  of  a  linear  SDE  is  easy 
to  be  specified.  For  this  purpose  we  naturally  consider  the  general  solution  from 
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Fig.  13.2  OUP  for  c\  =  —0.9  and  Starting  Value  V(0)  =  5.1  including  Expected  Value  Function 
(/X  =  5,  (72  =  0.01) 


Proposition  12.2.  Note  that  Z(f),  as  an  exponential  function,  is  always  positive.  With 
the  restriction  02(f)  =  0  the  following  diffusion  is  obtained: 


(13.6) 


With  a  positive  starting  value  and  02(f)  >  0,  a  positive  evolution  of  X(t )  is  ensured. 
The  models  in  this  section  are  of  the  form  (13.6). 


Dothan 

Let  us  consider  a  special  case  more  extensively.  Dothan  (1978)  suggested  for  the 
interest  rate  dynamics  a  special  case  of  the  geometric  Brownian  motion: 


dX(t)  =  G\  X(t)  dW(t)  ,  X(0)  >  0  . 


290 


1 3  Interest  Rate  Models 


With  C2(t)  —  (J2(0  =  0  it  holds  in  this  case  that 

X(t)  =  X(0)  exp  |  (-1  of J  t  +  a\  W(r)|  , 

and  hence,  the  interest  rate  X(t)  can  in  fact  not  get  negative.  Not  X(t)  follows  a 
Gaussian  distribution  but  log(X(t))  does.  Furthermore,  in  Example  12.3  we  have 
determined  the  moments  (for  a  fixed  starting  value): 

—  X(0)  and  Var(X(t))  =  X2(0)  (exp(of  t)  —  l)  . 

Thus,  the  variance  of  the  process  increases  exponentially  which  is  why  the  model 
may  not  be  satisfactory  for  interest  rates. 


Brennan-Schwartz 

Brennan  and  Schwartz  (1980)  suggested  another  attractive  variant.  It  consists  of  a 
combination  of  Vasicek  (1977)  and  Dothan  (1978);  we  choose  the  drift  component 
just  as  for  the  Ornstein-Uhlenbeck  process  and  the  volatility  just  as  for  the  geometric 
Brownian  motion: 

dX(t)  =  ci  (X(t)  —  jji)dt  +  ax  X(t)  dW(t)  ,  X(0)  =  /z  >  0  ,  (13.7) 

where,  for  simplicity,  the  starting  value  is  set  equal  to  /z.  For  ci  <  0  it  holds  that 
C2  =  — ci/z  >  0  such  that  we  have  indeed  a  positive  interest  rate  dynamics.  For  this 
model  one  can  show  (see  Problem  13.4)  that  the  expected  value  results  just  as  for 
Dothan  (1978), 


M  i  ( t)  =fi=  X(0) , 

while  it  holds  for  the  variance: 

2  2 

Var(X(0)  =  -  f1  o  (exp  ((2ci  +  of)r)  -  1)  . 

2  ci  + 

If  ci  <  —o\ /2  (i.e.  2  ci  +  oy  <  0),  then  it  holds  that  the  variance  tends  to  a  fixed 
positive  value  ( t  ->  oo): 


Var  (X(t)) 


{i1  a2 
2  ci  +  of 


for  ci  < 


If  the  volatility  parameter  G\  is  relatively  small  compared  to  the  absolute  value  of 
the  negative  adjustment  parameter  ci,  then  the  model  (13.7)  provides  a  process  with 
a  fixed  expected  value  and  an  asymptotically  constant  variance.  Again,  one  speaks 
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of  “mean  reversion”.  Interestingly,  the  variance  is  not  only  influenced  by  G\  and 
c\  in  an  obvious  manner:  The  greater  <Ti,  the  greater  Var(X(f)),  and  the  greater  c\ 
in  absolute  value,  the  more  strongly  or  the  faster  the  adjustment  happens  (and  the 
smaller  is  the  variance).  The  parameter  fi  >  0  from  the  drift  function  as  well  has 
a  positive  effect  on  the  variance.  Intuitively,  this  is  obvious:  The  smaller  fi  (i.e. 
the  closer  to  zero),  the  lesser  X(t )  can  spread  as  the  process  does  not  get  negative; 
conversely,  it  holds  that  the  scope  for  the  variance  between  the  zero  line  and  /x 
increases  with  /x  growing. 


Simulations 

Again,  processes  with  T  —  20  and  T  —  1000  were  simulated.  For  reasons  of 
graphical  comparability,  the  same  WP  as  in  the  previous  section  is  assumed,  i.e. 
the  1000  random  variables  being  filtered  by  the  recursion  (12.16)  are  identical. 

In  Fig.  13.3  it  is  obvious  how  the  variance  of  the  geometric  Brownian  motion 
suggested  by  Dothan  (1978)  increases  with  the  parameter  o\ .  Although  the  expected 
value  is  constant  and  equal  to  the  starting  value,  long  periods  are  possible  and 
probable  in  which  the  process  does  not  cross  the  expected  value.  A  more  plausible 
interest  rate  dynamics  can  be  observed  in  Fig.  13.4  for  two  values  of  the  adjustment 


Fig.  13.3  Dothan  for  G\  =  0.01  (26)  and  G\  =  0.02  (X2)  (X(0)  =  /x  =  5) 
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Fig.  13.4  Brennan-Schwartz  for  c\  =  —0.9  (Xi)  and  c\  =  —0.1  (X2)  (X(0)  =  /i  =  5,  <j\  = 
0.01) 

parameter  c\.  The  values  are  chosen  small  enough  (relative  to  <7i),  such  that  the 
variance  remains  bounded  and  converges  to  a  fixed  value.  It  is  obvious:  The  greater 
ci  in  absolute  value,  the  smaller  the  variance. 


1 3.4  Nonlinear  Models 

Chan,  Karolyi,  Longstaff,  and  Sanders  (1992)  [in  short:  CKLS]  considered  the 
following  class  of  nonlinear  equations  for  modeling  short-term  interest  rates  which 
is  covered  in  this  section: 

dX(t)  =  ci  (X(t)  +  oXY(t)  dW(t)  ,  \i  >  0 ,  0  <  y  <  1  .  (13.8) 

Thus,  the  modeling  of  the  drift  component  always  corresponds  to  the  one  by  Vasicek 
(1977).  The  OUP  from  (13.1)  just  results  for  y  =  0  while  y  —  1  leads  to  the 
just  discussed  process  from  (13.7).  Noninteger  values  of  y  in  between  provide  a 
nonlinear  interest  rate  dynamics.  The  process  from  (13.8)  is  sometimes  also  called 
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model  with  constant  elasticity  as  it  holds  for  the  elasticity  with  the  derivative  of  the 
volatility  cr  Xy  ( t )  with  respect  to  X  that: 

d(crXV)  X_ 
dX  ffP  ~~  V' 

In  order  to  show  the  interpretation  of  y  as  an  elasticity,  we  consider  a  discretiza¬ 
tion  of  the  CKLS  process  as  for  the  computer  simulation.  For  this  purpose,  we  define 
for  discrete  steps  of  the  length  1,  t  =  1,2 , ,T: 

xt  :=  X(t) ,  et  :=  AW(t)  =  W(i)  -  W(t  -  1) . 

The  discrete-time  version  of  (13.8)  hence  reads 

Ax,  =  Cl  (x,-i  -  fi)  +  ctxJIj  s, ,  s,  ~  iW (0,  1)  , 


or 


Xt  =  x,-i  +  Cl  (x,-i  -  ii)  +  ax^e, . 


For  the  conditional  variance  it  holds: 


Var(x,  |x,-i  )  =  a2x]Li  ■ 

Correspondingly,  it  holds  for  the  conditional  standard  deviation  that  e.g.  a  doubling 
of  xt-\  leads  to  a  multiplication  by  the  factor  2Y: 

V^Wm)  =  oxYt_ j  for  x,-i  =  2xt-i  , 

=  a  2rxY_l 

=  2r  v/Var(x,|x/_i) . 

Two  simulated  paths  of  the  CKLS  model  are  depicted  in  Fig.  13.5.  They  only 
differ  in  the  elasticity  y.  It  is  not  surprising  that  the  deviations  from  /x  get  greater 
with  a  greater  y. 


Cox,  Ingersoll  &  Ross  [CIR] 

A  particularly  prominent  representative  of  (13.8)  is  obtained  for  y  —  0.5.  This 
model  is  often  used  following  Cox,  Ingersoll,  and  Ross  (1985): 


dX(t)  —  c\  (X(t)  —  /x)  dt  +  a  y/x(t)  dW(t) ,  /x  >  0  ,  ci  <  0  . 


(13.9) 
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Fig.  13.5  CKLS  with  y  =  0.25  (Xi)  and  y  =  0.75  (X2)  for  c\  =  —0.9  (X(0)  =  fi  =  5, 
cr  =  0.01) 


The  conditional  standard  deviation  is  modeled  as  a  square  root  which  is  why  one 
also  speaks  of  (13.9)  as  a  “square  root  process”.  Consequently,  the  conditional 
variance  of  the  increments  is  proportional  to  the  level  of  the  process. 

For  this  nonlinear  SDE  it  can  be  formally  shown,  which  is  also  intuitive:  If  X(t) 
(starting  from  a  positive  starting  value  X(0)  >  0)  takes  on  the  value  zero,  then  the 
variance  is  zero  as  well,  but  the  change  dX(t)  gets  a  positive  impulse  of  the  strength 
—c i  {i  such  that  the  process  is  reflected  on  the  zero  line  for  (i  >  0.  Insofar  the 
square  root  process  overcomes  the  deficiency  of  the  OUP  as  an  interest  rate  model. 
However,  an  analytical  representation  of  the  solution  of  (13.9)  is  not  known. 

Already  for  the  ordinary  square  root  process  from  (13.9)  with  a(t,x )  =  cr  y/x 
the  condition  of  existence  (El)  from  Proposition  12.1  is  not  fulfilled  anymore  as  the 
derivative  at  zero  does  not  exist.  Fortunately,  there  are  weaker  conditions  ensuring 
the  existence  of  a  solution  of  (13.9)  -  however,  they  do  not  guarantee  the  finiteness 
of  the  first  two  moments  anymore.  In  order  to  show  that  finite  moments  exist  up 
to  the  second  order,  we  would  need  more  fundamental  arguments.  Instead,  we  start 
with  calculating  the  moments  (finiteness  assumed). 

For  reasons  of  simplicity,  assume  a  fixed  starting  value  equal  to  fi  in  the 
following:  X(0)  =  /z.  Then  we  obtain  (see  Problem  13.5),  as  for  the  OUP,  on 
average 


fi |  (t)  =  E(X(»)  =  ec,r  X(0)  +  \i  (1  -  ec,‘)  =  [i . 
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Under  the  assumption  on  the  starting  value  X(0)  =  /x,  for  the  second  moment  we 
obtain  (cf.  Problem  13.6) 


Ac  i 


from  which  it  immediately  follows  for  the  variance 


Var(X(f))  =  (1  —  e2c,t) 

—2  c\ 


The  asymptotic  variance  for  t  ->  oo  hence  coincides  with  the  one  of  the  OUP  if 
l±  —  1 ;  for  /x  <  1  it  turns  out  to  be  smaller  (as  the  process  is  reflected  on  the  zero 
line  and  therefore  varies  in  a  narrow  band)  while  it  is  obviously  greater  for  /x  >  1 . 
The  border  case  /x  =  0  makes  sense  as  well:  Here,  the  asymptotic  variance  is  zero 
as,  sooner  or  later,  the  process  is  absorbed  by  the  zero  line. 

For  Fig.  13.6  an  OUP  with  c\  —  —0.9  and  02  =  0.01  was  simulated  but 
the  expected  value  of  5  %  is  now  written  as  0.05.  In  the  example  it  becomes 
clear  that  the  OUP  can  definitely  become  negative.  In  comparison,  we  observe  a 
numerical  solution  of  the  corresponding  square  root  process  from  (13.9)  with  the 
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Fig.  13.6  OUP  and  CIR  for  cx  =  -0.9  (X(0)  =  /x  =  0.05,  a  =  a2  =  0.01) 
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same  volatility  parameter  and  the  same  drift  component.  The  picture  confirms  the 
theoretical  considerations:  The  process  exhibits  a  smaller  variance  and  does  not  get 
negative. 


Further  Models  and  Parameter  Estimation 

Marsh  and  Rosenfeld  (1983)  mention  the  variant  with  /i  —  0  as  a  borderline  case 
of  (13.8).  Cox,  Ingersoll,  and  Ross  (1980)  consider  a  version  with  y  >  1  for  a  special 
investigation: 


dX(t)  =  aX3/2(t)dW(t) . 

Finally,  some  models  are  applied  which  leave  the  framework  of  CKLS  from  (13.8) 
entirely,  e.g.  Constantinides  and  Ingersoll  (1984)  with 

dX(t )  =  cX2(t)  dt  +  cr  X3/2(t )  dW{t) , 

where  both  drift  and  volatility  are  nonlinear. 

Given  the  copious  possibilities  for  specifying  a  diffusion  process,  it  is  not 
surprising  that  it  has  been  tried  to,  first,  estimate  unknown  parameters  and  second 
to  statistically  discriminate  between  the  different  model  classes.  Beside  the  work  by 
CKLS,  the  papers  by  Broze,  Scaillet,  and  Zakoi'an  (1995)  and  Tse  (1995)  should  be 
mentioned.  As  a  first  introduction  to  the  topic  of  estimation  of  diffusion  parameters, 
the  corresponding  chapter  by  Gourieroux  and  Jasiak  (2001)  is  recommended. 


1 3.5  Problems  and  Solutions 

Problems 

13.1  Derive  the  solution  (13.2)  of  Eq.  (13.1). 

13.2  Derive  the  moments,  (13.4)  and  (13.5),  of  the  OUP  (a  fixed  starting  value  X(0) 
assumed). 

13.3  Discuss  Eq.  (13.1)  for  c\  —  0  as  a  special  case  of  the  OUP  (a  proposal  by 
Merton,  1973).  For  this  purpose,  consider  the  solution,  the  expected  value  and  the 
variance  for  c\  ->  0,  if  necessary  with  L’ Hospital’s  rule.  (You  should  be  familiar 
with  the  results.  By  which  name  do  you  know  the  process  as  well?) 

13.4  Consider  now,  as  a  combination  of  the  interest  models  by  Vasicek  (1977)  and 
Dothan  (1978),  the  process  from  (13.7)  by  Brennan  and  Schwartz  (1980), 


dX(t)  =  ci  (X(t)  —  /i)dt  -h  <7\  X(t)  dW(t)  ,  11  =  X(0)  >  0  , 
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particularly  with  the  starting  value  X(0)  =  /x.  Determine  expectation  and  variance. 
How  do  these  behave  for  t  ->  oo  if  it  holds  that  2  c\  <  —cr2? 

13.5  Consider  the  square  root  process  (13.9)  by  Cox  et  al.  (1985).  Under  the 
assumption  X(0)  =  /x  for  the  starting  value,  derive  an  expression  for  the  expected 
value. 

13.6  Again,  consider  the  square  root  process  (13.9)  by  Cox  et  al.  (1985).  Under  the 
assumption  X(0)  =  /x  for  the  starting  value,  derive  an  expression  for  the  variance. 


Solutions 


13.1  Equation  (13.1)  is  a  special  case  of  (12.12)  with  constant  coefficients.  Hence, 
the  solution  results  from  (12.13)  with 


i  s: 


z{t)  —  exp  {  /  c\ds\  —  ecit. 


From  this  it  follows 


/s)a=cj/ 


o 


C2je 
0 

-  e~ 
Cl 


-C\S 


ds 


CIS 


0 


-  --  {e-c"  -  1). 
Cl 


Thus,  from  (12.13)  we  obtain  the  desired  result: 


X(t)  =  e 


_  C\t 


X(0)  -  —  (e~c,t  -  1)  +  [  o2  e~c's  dW(s) 
c  i  Jo 


13.2  The  expected  value  function  is  determined  from  (12.14): 


M  i  ( 0  =  e 


_  ,.C\  t 


L 


=  eci‘ 


X(0)+  /  c2e~ClSds 


X(0)  -  —  (e~cit  -  1) 

Cl 


With  fi  —  the  formula  from  (13.4)  results. 


298 


1 3  Interest  Rate  Models 


The  variance  expression  follows  from  (12.15): 


VarTO))  =  {e«f  jf  (£) 

f' 


—  e  1  a2  /  e 


_  „2ci?  ^2  I  „—2c\s 

0 

=  e2cit  cr? 


2c 


l 


—  —e2cit  —  ^_2ci/l 


2ci 


<7? 


— 2ciS 


/ 


0 


This  expression  coincides  with  (13.5). 
13.3  L’ Hospital’s  rule  provides  for  c  i 

-Cl/  _  1 

lim  -  : 

ci— >0  Ci 

Hence,  X(t)  from  (13.2)  merges  for  ci 


>  0: 

— t  e~cit 

lim  - 

ci— >0  1 


0  into 


=  —t. 


X(0)  +  C2  t  + 


=  X(0  )  +  c2t  +  G2W(t). 


Equation  (13.4)  is  not  suitable  for  determining  the  expected  value  as  fi  =  is  not 
defined  for  ci  ->  0.  Instead,  one  directly  obtains  (for  X(0)  fixed): 


/H  (0  —  E(X(f))  —  X(0)  +  C2 1  +  0. 


This  linear  growth  is  distinctive  for  the  Brownian  motion  with  drift,  cf.  Chap.  7. 

In  order  to  determine  the  variance  from  (13.5),  it  is  again  argued  with 
L’ Hospital’s  rule: 


I  _ ^2cu 

lim  - 

ci->o  2  ci 


—2  te2cit 

lim  -  =  —t. 

ci— >o  2 


Therefore  it  holds  that 


lim  Var(X(t))  —  o\t  —  Var(<T2 

ci— >*0 

This  is  the  familiar  variance  of  a  Brownian  motion  with  drift.  Indeed,  a  Brownian 
motion  with  drift  is  the  same  as  the  process  resulting  from  the  OUP  for  ci  =  0  or 

ci  0. 
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13.4  This  equation  is  linear  and  does  not  belong  to  the  category  “additive  noise”.  It 
is  rather  a  special  case  of  (12.3)  with 


Ci(t)  =  cu  c2(t)  =  -fici,  (7 1  (t)  =  a, ,  02(f)  =0. 


With 


_  „C\t 


z(t)  =  e 


due  to  Proposition  12.3  the  expected  value  is  given  by 


E(X(f))  =  e 


_  C\t 


E(X(0))  —  fici  f  e~c'sds 

Jo 


=  ecl'[E(X(0))  +  ne~c,t  -  fi] 
=  fi  +  ec,f[E(X(0))  —  fi]. 


Hence,  for  c\  <  0  it  holds: 


E(X(t))  ->  ji,  r  — >  oo, 


For  the  second  moment,  we  determine  from  (12.1 1): 


jJ-i  (t)  =  exp{(2ci  +  a2)?} 


M 2(0)  - 


l 


ICifJLfJLi  (.S') 


—(Is 

0  exp{(2ci  +  of)s}  . 


In  particular  for  X(0)  =  /x,  this  simplifies,  due  to  E(X(^))  =  /x,  to  (with  /X2(0) 
/x2): 


/x2(0  =  exp{(2ci  +  <J\)t} 


=  exp{(2ci  +  of)*} 


/x2  —  2ci/x2  /  exp{— (2c i  +  af)s}ds 

Jo  . 


/x“  +  2cifix 


2ci/x ' 


exp{-(2ci  +  of)s) 
2c  \  +  of 


-i  t 


Jo 


=  exp{(2ci  +  of)/}  /i2  +  - - r(exp{-(2ci  +  a2)t}  -  1) 


2c\ir 


2c  i  +  of 


+ 


2c  i  +  a\ 

o12/r2exp{(2ci  +  a2)t} 
2c  i  +  o2 
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Thus,  the  variance  for  X(0)  =  /x  reads: 

Var(X(t))  =  /x2(0  -  /x2 

of/x2  exp{(2ci  +  of)*}  2ci/x2  (2c  \  +  of)/x2 

2c  i  +  of  2ci  +  of  2c\  +  of 

of/x2  exp{(2ci  +  of)/}  of  ? 

—  — - o — 2 - - 2 — ~  /x 

2cj  T  o ]  2ci  T  cq 

=  2  (exP{(2ci  +  °fM  -  !)  • 

2c  1  +  CTj 

If  2c  i  <  —  CTj2,  then  it  hence  holds 


Var(X(f))  -  CTl  2  li2  >  0 , 

2ci  +  of 

as  t  — >  oo. 

13.5  In  order  to  determine  the  expected  value  function,  we  write  Eq.  (13.9)  in 
integral  form: 


/' 


X(/)  =  X(0)  +  /  ci(X(5)  —  \i)ds  +  a 


f  Jx(s)dW(s). 

Jo 


By  assumption,  /xi(0)  =  E(X(0))  =  E(/x)  =  /x.  Thus,  Propositions  8.2  and  10.3(b) 
yield: 


/xi(t)  —  /x  + 


I 


ci(/xi(s)  —  fi)ds 


+  0 


or  rather 


J/Xi(t)  =  ci(/xi(t)  —  ji)dt. 

Due  to  (12.5),  the  solution  of  this  deterministic  differential  equation  reads: 


Mi(0  =  z(t) 


ix  i(0)+  f  (  Cl^ds 

Jo 


z(s ) 


=  ec>( 


f 


/x  —  i±c\  I  e  ~ClSds 
o 


=  eClt  [/x  +  /xc  c15|q] 

=  fiecxt  [1  +  c“Cl?  -  1] 


=  /x. 
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13.6  In  order  to  determine  the  function  of  the  second  moment  analogously  to  the 
expected  value  in  Problem  13.5,  we  search  for  an  integral  equation  for  X2(t).  This 
is  provided  by  Ito’s  lemma  for  g(X)  —  X 2  with  g'(X)  —  2X  and  g"(X)  =  2: 

dX2(t)  =  2X(t)dX(t)  +  -2<j2(t)dt 
=  2  X(t)dX(t)  +  a2X(t)dt, 


where  a  it)  =  a  y/X(t)  from  (13.9)  was  substituted.  Plugging  in  the  definition  of 
dX(t)  further  provides: 


dx2(t)  =  (2ciX2(t)  -  2naX(t)  +  a2X(t))  dt  +  2aX(t)y/x(i)dW(t), 


or  rather 


x2(t)  =  X2(0)  +  f  (2c\X2(s)  +  (ct2  -  2nd)X(s))  ds  +  2a  f  X(s)  y/x(s)dW(s). 

Jo  Jo 

With  X(0)  =  \i  forming  expectation  yields  from  Propositions  8.2  and  10.3(b): 

fi2(t)  =  fi2+  (2cifi2(s)  +  (a2  -  ds  +  0, 

Jo 

as  —  (i  is  constant.  Thus,  for  fi2(t)  a  deterministic  differential  equation 

results: 


dfi2(i)  —  (2ci/x2(0  +  o2{i  —  2fi2c\)dt. 


With 


z(t)  —  exp  |  J  2ci<isj  =  e2cit 


the  solution  reads 


fi2(t)  =  e 


_  Zlc\t 


>t  ~2 


crz/x  —  2/x2ci 


/x2  +  /  - ~ - ds 

J o  *2c's 


=  e2ci> 


,2c  it 


'0  ^ 

.2  . .  o  i , 2. 


2  (az[i-2lizcl)  _2ci, 

H'  0  e 

2c\ 


o 


2c\ 


\2c\fji 2  —  (a2/x  —  2fi2c\)(e  2cit  —  1)] 
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2ci? 

- \(j2fi  —  (a2  fi  —  2/i2c\)e~2cit] 

2c  i 


o2\i 

2c\ 


+ 


A,. 

2c\ 


Thus,  in  a  last  step  the  required  variance  is  calculated  as 

Var(X(0)  =  /Mt)  - I  A  ( t) 


2c\ 
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Asymptotics  of  Integrated  Processes 


14 


14.1  Summary 


This  chapter  aims  at  providing  the  basics  in  order  to  understand  the  asymptotic 
distributions  of  modern  time  series  econometrics.  In  the  first  section,  we  treat  the 
mathematical  problems  of  a  functional  limit  theory  as  solved  and  get  to  know  the 
basic  ingredients  of  a  functional  limit  theory.  Then,  we  proceed  somewhat  more 
abstractly  by  presenting  the  mathematical  hurdles  to  be  overcome  in  order  to  arrive 
at  a  functional  limit  theory.  Finally,  we  consider  multivariate  generalizations. 


1 4.2  Limiting  Distributions  of  Integrated  Processes 

Under  classical  assumptions  it  holds  that  the  arithmetic  mean  of  a  sample  converges 
to  the  expected  value  of  the  sample  variables  for  the  growing  sample  size.  However, 
if  the  sample  is  generated  by  a  random  walk,  then  this  does  not  hold  any  longer. 
Hence,  limits  and  limiting  distributions  for  so-called  integrated  processes  will  now 
be  discussed. 

Long-Run  Variance 

In  order  to  technically  formulate  the  concept  of  an  integrated  process,  we  need 
the  so-called  long-run  variance.  However,  this  involves  an  old  acquaintance  from 
Chap.  4. 


^or  a  review  on  the  convergence  of  random  sequences,  we  recommend  Potscher  and  Prucha 


(2001). 


©  Springer  International  Publishing  Switzerland  2016 

U.  Hassler,  Stochastic  Processes  and  Calculus ,  Springer  Texts  in  Business 

and  Economics,  DOI  10.1007/978-3-319-23428-l_14 
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Let  {( et }  denote  a  stationary  discrete-time  process  with  zero  expectation  and  the 
auotcovariances 


Ye(h )  =  Cov(e,,  et+h)  =  E(e,et+h )  ,  E (et)  =  0  . 

As  long-run  variance  we  define 

oo 

co2  =  ye(0)  +  2  ^  ye(/?)  <  oo .  (14.1) 

h=\ 

Here,  we  rule  out  the  case  of  a  fractionally  integrated  process  with  long  memory, 
1(d)  with  d  >  0,  as  introduced  in  Chap.  5,  since  we  require  the  autocovariances  to  be 
summable:  co]  <  oo.  In  the  case  of  a  pure  random  process  or  white  noise,  et  —  st, 
variance  and  long-run  variance  naturally  coincide: 

co2  =  cr2  =  ys(0)  if  Ye(h)  =  0 ,  h  ±  0 . 

In  general,  it  holds  that  the  long-run  variance  is  a  multiple  of  the  spectrum  at  the 
frequency  zero,  see  (4.3): 


co2e  =  2 nfe(0)  . 

Now,  we  further  assume  an  MA(oo)  process  for  {< et },  see  (3.2): 

oo 

6t  —  ^  ^  Cj  & t—j  >  C0  —  l?f  —  l,...,/2, 

7=0 

with  absolutely  summable  coefficients: 


<  oo . 


(14.2) 


(14.3) 


In  order  to  determine  the  long-run  variance,  we  establish  an  alternative  expression 
for  MA(oo)  processes  in  Problem  14.1: 

<»e  =  °2  ^ Ci  j  '  ( UA) 

Due  to  2nfe(0)  —  co 2e,  one  can  directly  read  this  relation  from  Eq.  (4.5),  too. 

Example  14.1  (Long-run  Variance  of  MA(1))  We  consider  a  moving  average  pro¬ 
cess  of  order  1, 


et  —  st  b  st-\  . 
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From  Example  4.3  we  adopt  the  spectrum, 

/  (A)  =  (l  +  b2  +  2b  cos  (A))  a2/ 2tz  . 

Thus,  at  the  origin  the  long-run  variance  results, 

co2  —  2jtf(0)  =  (1  +  b )2  a2. 

At  the  minimum,  this  expression  takes  on  the  value  zero  which  happens  for  b  —  —  1 : 

et  —  st  —  £t—  i  —  A  st . 

In  this  case  {et}  is  “overdifferenced”.  What  is  meant  by  this,  will  be  explained  in  the 
following.  ■ 

Integrated  Processes 

We  revisit  Example  14.1  and  consider  the  differences  of  a  stationary  MA(oo) 
process  {et}\ 

oo  oo 

Aet  —  et  —  et-\  —  Cj  £t~j  —  Cj  £t-j-\ 

j= 0  j= 0 

oo 

=  co  £t  +  ^2  ( CJ  -  cj~l)  £t~j  • 

7=1 

Therefore,  is  also  a  stationary  process  where  the  coefficients  are  now  called 

H}; 

oo 

A  6^  —  ^  ^  dj  &t—j  i  do  —  Co  —  1 ,  dj  —  Cj  Cj—  \  . 

7=0 

By  definition,  it  hence  holds 

oo 

'Sy^j  dj  —  Co  +  (ci  —  Co)  +  (C2  —  Cl)  +  •  •  •  =  0  . 

7=0 

Thus,  we  obtain  for  the  long-run  variance: 
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The  process  {Aet}  is  overdifferenced:  It  is  differenced  more  often  than  necessary 
for  stationarity  as  {et}  itself  is  already  stationary.  Overdifferencing  is  reflected  in 
the  long-run  variance  being  zero. 

If  the  stationary,  absolutely  summable  process  {« et }  has  a  positive  long-run 
variance,  then  we  call  it  integrated  of  order  zero  (in  symbols:  et  ~  /( 0)): 

et  -  1(0)  0  <  a)2e  <  oo  . 

Verbally,  this  means:  We  have  to  difference  zero  times  in  order  to  attain  stationarity, 
and  it  is  not  differenced  once  more  than  needed.  Technically,  in  terms  of  Chap.  5, 
this  means  that  the  process  is  fractionally  integrated  of  order  d  —  0. 

Finally,  the  process  {xt}  with 

t 

x,  —  2J  ej  ,  e,  ~  7(0)  ,  t=\,...,n, 

j=  i 

is  called  integrated  of  order  one,  1(1),  as  it  is  defined  as  sum  (“integral”)  of  an  1(0) 
process.  The  random  walk  from  (1.8)  e.g.  is  integrated  of  order  one  and  obviously 
nonstationary.  It  holds  for  1(1)  random  walks  that  differencing  once, 


is  required  by  definition  to  obtain  stationarity.  Hence,  1(1)  processes  are  sometimes 
called  difference- stationary.  Also,  1(1)  processes  are  often  labelled  as  unit  root 
processes,  or  are  said  to  have  an  autoregressive  unit  root.  We  briefly  want  to 
elaborate  on  this  terminology.  Assume  that  {et}  is  a  stationary  autoregressive  process 
of  order  p , 


Ae(L)et  —  st  with  Ae(L)  —  1  —  a\L  — - apLP  . 


Consequently,  the  1(1)  process  is  autoregressive  of  order  p  +  1  since 


or  Ax(L)x,  =  e,  , 


where 

AX(L )  =  Ae(L)  (1  —  L) 

=  1  —  (ci\  —  1  )L  —  («2  —  cl{)L 2 - (ap  —  ap-\)Lp  +  apLp+l  . 


2In  econometrics,  this  overdifferencing  is  also  described  by  the  fact  that  {Aet}  is  integrated  of  order 
—  1,  Aet  ~  I( — 1):  {et}  is  differenced  one  more  time  although  the  process  is  already  stationary. 
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Hence,  Ax(z)  has  a  unit  root,  meaning  that  Ax(z)  =  0  has  a  solution  on  the  unit 
circle,  namely  the  real  root  z  —  1:  AJC(1)  =  0. 


The  Functional  Central  Limit  Theorem  (FCLT) 

Now,  we  make  statements  on  the  distribution  of  the  stochastic  step  function  (the 
partial  sum  process) 


-0.5  L9WJ 

X„(s)  =  - y 2ej ,  s  e  [0, 1]  .  (14.5) 

(D„  Z— ' 


Here,  |_yj  denotes  the  integer  part  of  a  real  number  y.  In  order  to  be  able  to  divide  by 
coe,  the  process  {et}  has  to  be  1(0).  In  (7.1),  we  have  already  considered  a  precursor 
of  this  step  function  as  it  holds: 


Xn(s)  = 


n 


For  a  graphical  illustration,  recall  Fig.  7.1.  The  following  proposition  for  MA(oo) 
processes  from  (14.2)  holds  under  some  additional  assumptions. 


Proposition  14.1  (FCLT)  Let  {et}  from  (14.2)  be  integrated  of  order  zero  and 
satisfy  some  additional  assumptions.  Then  it  holds  for  Xn(s)  from  (14.5)  that, 

n-o.5 

Xn(s)  =  - Y  ej  =»  W(s) ,  s  e  [0, 1]  ,  n  ->  oo  , 

coe  ^ 
e  7=1 

where  >  0  is  from  (14.1). 

Note  that  the  FCLT,  so  to  speak,  consists  of  infinitely  many  central  limit  theorems. 
For  a  fixed  s  it  namely  holds  that 

-0.5 

xn(s)  =  - - y  ej  4  W(s)  -  J\f(0,  s) . 

C0e  ^ 
e  7=1 


3 Phillips  and  Solo  (1992)  assume  that  the  innovations  {st}  form  an  iid  sequence  and  that 
T£oJ\cj\  <  oo,  which  is  more  restrictive  than  (14.3).  Phillips  (1987)  or  Phillips  and  Perron 
(1988)  do  without  the  iid  assumption,  but  require  more  technical  restrictions.  For  a  discussion  of 
further  sets  of  assumptions  ensuring  Proposition  14.1  see  also  Davidson  (1994). 
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However,  as  this  holds  for  each  s  e  [0, 1],  we  have  quasi  uncountably  many  central 
limit  theorems  collected  in  Proposition  14.1.  The  mathematically  precise  collection 
is  the  so-called  weak  convergence  in  function  spaces  which  is  symbolized  by 
For  the  following,  an  intuitive  notion  thereof  suffices,  somewhat  more  rigorous 
remarks  will  be  given  in  the  next  section. 

As  the  1(1)  process  {xt}  was  defined  as  the  sum  of  the  past  of  {. et },  the 
circumstance  from  the  proposition  can  also  be  expressed  as  follows:  It  holds  for 
X,  =  J2j=  1  ej  that 


- x^sn\  =>  W(s) ,  se[0, 1]. 

(tie 

The  first  FCLT  was  proved  by  Donsker  for  pure  random  processes  (Donsker,  1951). 
Particularly  for  an  iid  sequence  et  =  st,  one  hence  speaks  of  Donsker’ s  theorem. 
Frequently,  a  FCTL  also  operates  under  the  name  “invariance  principle”  as  it  is 
invariant  with  respect  to  the  distribution  of  {et}. 


First  Implications 

The  following  proposition  assembles  some  implications  being  of  immediate  rele¬ 
vance  in  application.  As  an  exercise,  we  encourage  the  reader  to  come  up  with  the 
proof  in  order  to  understand  why  which  powers  of  the  sample  size  n  appear  in  the 
normalization  of  the  sums;  see  Problems  14.3  through  14.5.  Note  that  the  weak 

d 

convergence,  will  be  discussed  in  the  next  section,  while  stands  for  the 
usual  convergence  in  distribution. 

Proposition  14.2  (Some  Limiting  Distributions)  Let  xt  —  xt-\  +  et  with  vo  =  0, 
t  —  1 , . . . ,  n,  i.e. 


x,  =  J2  ei  - 

7=1 

where  {et}  is  1(0)  as  in  Proposition  14.1.  Then  it  holds  for  n  —>  oo  : 


n 


(a)  n  2  xt- 1 

t=  l 


(tie  f  W(s )  ds  , 


0 


n 


(b)  n  2  J2  tet 

t=  l 


(tie  f  sdW(s)  , 


0 


n 


(tie  f  S  W(s)  ds  , 
0 


(c)  n  2  J2  txt- 1 

t=  l 

(d)  n~0  5  ( et  —  e)  =>  coe  (W(s)  -  s  1T(1))  ,  e 

t=  l 


n 


t=  1 


14.2  Limiting  Distributions  of  Integrated  Processes 


309 


( e )  n  2  Me  /  W2(s)ds, 

t=  1  o 

(/) ^  4  f  (w2i  1)  -  ) 

=  o>2  j  /  W(S)  <W(S)  + 

(  0 


with  ye( 0)  =  Var(et)  and  co^  from  (14.1). 

Two  remarks  on  the  functional  form  of  the  statements  shall  be  given. 

Remark  1  Note  the  elegant  and  evocative  functional  analogy  of  the  sums  on  the 
left-hand  side,  respectively,  and  the  integrals  on  the  right-hand  side  in  (a)-(c)  and 
(e):  Here,  the  sums  are  substituted  by  integrals,  the  Wiener  process  corresponds 
to  the  1(1)  process  {xt},  and  the  increments  of  the  WP  dW  correspond  to  the  1(0) 
increments  Axt  —  et. 

Remark  2  If  et  =  st  is  white  noise,  then  Ito’s  lemma  in  form  of  (10.3)  also  yields 
an  accordance  of  the  functional  form  of  sample  variables  and  limiting  distributions 
in  (f): 


n  1  £ i  4 

t=  1 


y  (W2(l)-l)=ft)2^1 


W(s)  dW(s) . 


As  well,  the  limiting  process  appearing  in  (d)  is  intuitively  well  justified.  It  is  a 
Brownian  bridge  with  W(l)  —  1  W(l)  =  0  which  just  reflects 


XA  -e)=0 

t=  1 


for  s  =  1 . 

Example  14.2  ( Demeaned  WP)  From  Proposition  14.2  (a)  results  due  to  xq  —  0: 


n 


n 


n 


n 


n  0-5  v 


—  n  2 


y>  =  «  i  i  yyy,-i  +  x„ )  =  n  2 1  i  + 


3 

n  2 


t=  l 


t=  l 


*=1  7=1 


d 


f 

JO 


odp  I  W^)  ds  +  0 
o 
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Hence,  it  holds  for  {x?}  after  demeaning  the  following  FCTL: 

~X  =$►  W(s)  -  f  W(r )  dr , 

a>e  V«  Jo 

where 

W(s)  :=  W(s)  -  [  W(r)  dr 

Jo 

is  also  called  a  demeaned  Wiener  process.  ■ 

In  the  example  it  was  argued  that  it  is  negligible  for  the  asymptotics  whether  we 
sum  over  xt-\  or  xt.  This  holds  in  Proposition  14.2  (a),  (c),  (e)  but  not  in  (f),  where 
on  the  right-hand  side  the  sign  in  front  of  ye( 0)  changes.  The  following  corollary 
summarizes  the  corresponding  results,  cf.  as  well  Problem  14.2. 

Corollary  14.1  (Some  Limiting  Distributions)  Let  xt  —  xt-\  +  et  with  xo  —  0, 
t  —  1 , . . . ,  n ,  /.  e. 


x,  =  J2ej' 

7=1 


where  {et}  is  1(0)  as  in  Proposition  14.1.  Then  it  holds  for  n  oo 


n 


d 


coe  f  W(s )  ds  , 


n  2  Jfxt 

t=  1  o 

_ 5  n  d  1 

n  2  txt  — >  coe  f  s  W(s )  ds  , 
t=  i  o 


_2  "  2  d 


n~2  E  A 

t=  l 
n 


co 2  f  W2  (s)  ds  , 
0 


n-lZx,eAi  (w\  1)  +  ^) 

/ 


«2  +  ye(0) 


=  ®e2  \fW(s)dW(s)+  2w2 

(  0  e 


with  ye( 0)  =  Var(et )  co2  from  (14.1) 


1 4.3  Weak  Convergence  of  Functions 

In  this  subsection  we  want  to  briefly  occupy  ourselves  with  the  mathematical 
concepts  hiding  behind  Proposition  14.1.  More  rigorous  expositions  addressing  an 
econometric  audience  can  be  found  in  Davidson  (1994)  or  White  (2001),  see  also 
the  classical  mathematical  reference  by  Billingsley  (1968). 
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Metric  Function  Spaces 


Recall  the  stochastic  step  function  that  has  led  us  to  the  WP,  see  (7.1). 


X„(t)  = 


4E-  ,  i 

o  y/n  ■’  L  n  ’  n/  ’ 

4  E;-i  s, ,  t  =  1  , 

a  y/n  ^ J —  1  J  ’ 


=  1, 2, . . .  , ft 


which  can  also  be  written  more  compactly  as 


-0.5 


Xn(t)  —  - £/  >  r  ^  [o?  1] 

a  ^ 


j=  1 


Furthermore,  we  define  X„(r)  as  the  function  that  coincides  with  X„(r)  at  the  lower 
endpoint  of  the  interval.  However,  it  is  not  constant  on  the  intervals,  but  varies 
linearly: 


-0.5  M 


xn(t)  =  - —  V  sj  + 

cr  ' 


£  L/zd + 1 


j=  1 


a  jn 


,  t  e  [0, 1]  . 


By  construction,  X„(t)  is  a  continuous  function  on  [0, 1],  for  which  we  also 
abbreviate 


Xn  G  C[0,1]  . 

In  contrast,  Xn(t)  is  only  right-continuous  and  exhibits  (removable)  discontinuities 
of  the  first  type  (i.e.  jump  discontinuities).  It  belongs  to  the  set  of  so-called  cadlag 
functions  that  is  denoted  by  D  [0, 1]  due  to  the  discontinuities: 

XneD[ 0,1]  . 

Obviously,  the  set  of  continuous  functions  is  a  subset  of  the  cadlag  functions,  i.e. 
C[0, 1]  c  D  [0, 1].  Now,  we  want  Xn(t)  as  well  as  Xn(t)  to  converge  to  a  WP  W(t). 
For  this  purpose  we  need  a  distance  measure  in  function  spaces,  a  metric  d.  A 
precise  mathematical  definition  follows. 

Metric  space :  Let  M  be  an  arbitrary  set  and  d  a  metric ,  i.e.  a  mapping, 

d  :  MxM  M+, 


4This  French  acronym  (sometimes  also  “cadlag”)  stands  for  “continue  a  droite,  (avec  une)  limite  a 
gauche”:  right-continuous  and  bounded  on  the  left. 


312 


1 4  Asymptotics  of  Integrated  Processes 


which  assigns  to  v  and  y  from  M  a  non-negative  number  such  that  the  following 
three  conditions  are  satisfied: 

d(x,y)  =  0  x  =  y, 

d(x,y )  =  d(y,x )  (symmetry), 

d(x,  y)  <  d(x ,  z)  +  d(z,  y)  (triangle  inequality) . 

Then,  M  endowed  with  d  is  called  a  metric  space,  (M,  d). 

Example  14.3  (Supremum  Metric )  Particularly  C[0,  1]  or  D  [0, 1]  are  readily 

endowed  with  the  supremum  metric: 

ds(f,g)  :=  sup  |/(0  -g(t)\  ,  f,  geD[ 0, 1]  • 

()</<  1 

In  Problem  14.6  it  is  shown  that  the  above-mentioned  three  defining  properties  are 
indeed  fulfilled.  ■ 

However,  as  Xn(t)  and  W(t)  are  stochastic  functions,  a  convergence  of  {Xn}  to 
W  cannot  simply  be  based  on  ds(Xn,  W).  The  convergence  of  {Xn}  to  W  has  to  be 
formulated  rather  as  a  statement  on  probabilities  or  expected  values.  In  order  to 
specify  this,  we  need  the  concept  of  continuous  functionals. 


Continuous  Functionals 

Let  the  mapping  h  assign  a  real  number  to  the  function/  e  D  [0,  1], 

h  :  D  [0, 1]  ->  R . 

As  the  argument  of  h  is  a  function,  one  often  speaks  of  a  functional. 

Now,  let  the  set  of  cadlag  functions  be  equipped  with  a  metric  d,  i.e.  let 
(D  [0,  1] ,  d)  be  a  metric  space.  Then  the  functional  h  with  h:  D[ 0,1]  ->  R  is  called 
continuous  with  respect  to  d  if  it  holds  for  all/,  g  e  D  [0, 1]  that 


\h(f)-h(g)\  0 
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An  alternative  definition  of  continuity  reads:  h  is  called  continuous  with  respect  to 
d  if  there  exists  a  8  >  0  with 

\h(f)~h(g)\  <  s  for d(f,g)  <  S. 


for  each  s  >  0. 

Strictly  speaking,  continuity  is  a  “pointwise”  property;  however,  if  a  functional 
is  continuous  for  every  considered  function,  then  one  generally  speaks  of  continuity 
of  the  functional.  The  integral  over  a  function  is  a  typical  example  for  a  continuous 
functional. 

Example  14.4  ( Three  Functionals )  Frequently,  we  encounter  the  following  func¬ 
tionals  in  econometrics: 


h(f)  =  f 

Jo 

hi(f)  =  f 

Jo 


1 

fit)  dt , 


f2(t)  dt , 


hif)  = 


1 

Jof20)dt 


It  can  be  shown  that  they  are  continuous  on  D  [0,  1]  with  respect  to  the  supremum 
metric  (cf.  Problem  14.7).  ■ 


Weak  Convergence 

We  consider  a  set  of  stochastic  elements,  let  them  be  random  variables  or  stochastic 
functions.  Let  M  be  a  set  of  stochastic  elements  and  d  a  metric.  We  define  somewhat 
loosely,  see  Billingsley  (1968,  Thm.  2.1):  A  sequence  Sn  e  M,  n  e  N,  converges 
weakly  to  S  e  M  for  n  ->  oo  if 

lim  E (h(Sn))  =  E (h(S)) 

«— >oo 


for  all  real-valued  mappings  h  that  are  bounded  and  uniformly  continuous  with 
respect  to  d.  Symbolically,  we  write 


^  =►  S. 


This  definition  in  terms  of  expected  values  is  not  very  illustrative  as  it  is  hard  to 
imagine  all  mappings  which  are  bounded  and  continuous.  In  order  to  translate  weak 
convergence  into  a  probability  statement,  we  consider  the  indicator  function  Ia  for 
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an  arbitrary  real  a  and  igR: 


7a  (X)  • —  I(—oo,a\  (X)  — 


1  ,  X  <  a 
0  ,  x  >  a 


By  linearization  on  [a,  a  +  e]  for  an  arbitrarily  small  e  >  0,  the  indicator  function 
can  be  continuously  approximated  by 


11  ,  x  <  a 

1  —  ,  a  <  x  <  a  +  s  . 

0  ,  x  >  a  +  s 

The  approximation  can  become  arbitrarily  close  to  Ia  for  small  s.  Let  us  now  choose 
M  —  D[  0, 1] .  Then  it  holds  for  the  stochastic  cadlag  processes  Xn(t)  and  X(t)  that 

P(xn(t)  <  a)  =  E  [Ia(Xn(t))]  %  E[l,(X„(t)}]  , 

P(X(t)  <  a)  =  E  [Ia(X(t))\  «  E  [Ja(X(t))]  . 

Hence,  it  holds  for  the  continuous  bounded  functional  h  —  Ia  for  an  arbitrary  a  eR 
in  case  of  weak  convergence  of  {Xn(t)}  to  X(t),  i.e.  for  E  [7fl(X„(t))]  ->  E  [7a(X(t))], 
that 


P(Xn(t)  <  a)  &  P(^f(0  <  a)  . 

Hence,  we  have  the  following  illustration  of  {Xn(t)}  converging  weakly  to  X(t)\  For 
every  point  in  time  t  it  holds  that  the  sequence  of  distribution  functions,  P(X/2(t)  < 
a ),  tends  to  the  distribution  function  of  X(t). 

If  M  particularly  denotes  the  set  of  real  random  variables  and  if  Xn  =>  X  holds, 
then  the  same  argument  shows  for  the  distribution  function  that: 

Fn(a)  P(Xn  <  a)  ^  P(X  <  a)  —  F(a )  . 

With  the  definition  from  the  end  of  Chap.  8,  weak  convergence  of  random  variables 

d 

hence  implies  their  convergence  in  distribution,  Xn  ->  X.  The  converse  holds  as 
well:  For  random  variables  {Xn}  and  X,  weak  convergence  is  synonymous  with 
convergence  in  distribution. 


Continuous  Mapping  Theorem 

A  further  ingredient  of  the  proof  of  statements  as  in  Proposition  14.2  is  presented 
by  the  continuous  mapping  theorem  (actually:  about  mappings  which  are  discon¬ 
tinuous  only  on  “infinitesimal  sets”);  see  Billingsley  (1968,  Thm.  5.1),  Davidson 
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(1994),  and  White  (2001).  We  consider  two  versions  of  the  proposition  which  are 
both  special  cases  of  a  more  general  formulation. 

Proposition  14.3  (Continuous  Mapping  Theorem  (CMT))  For  continuous  map¬ 
pings  of  convergent  series  it  holds: 

(a)  Let  {Xn}  be  a  sequence  of  real  random  variables  and  h,  h:  R  —>  R,  a  continuous 

d 

function.  From  Xn  ->  X  for  n  — >  oo  it  follows 

h(X„)  4  h(X). 

(b)  Let  {Xn  (s) }  and  X(y)  belong  to  D[ 0,  1]  and  be  h,  h:  D  [0,  1]  R,  a  continuous 
functional.  FromXn(s )  =4  X(s)forn  ->  oo  it  follows 

h(Xn(s))  4  h(X(s)) . 

Verbally,  the  continuous  mapping  theorem  means  that  mapping  and  limits  can  be 
interchanged  without  altering  the  result:  It  does  not  matter  whether  h  is  applied  first 
and  then  n  is  let  to  infinity,  or  whether  n  ->  oo  first  is  followed  by  the  mapping.  At 
first  sight,  this  may  seem  trivial  which  it  is  definitely  not,  see  Example  14.5. 

Remember  that  for  X  —  c  —  const  convergence  in  distribution  is  equivalent 
to  convergence  in  probability,  see  Sect.  8.4.  Hence,  the  CMT  holds  as  well  for 
convergence  in  probability  to  a  constant:  From 


for  n  — ►  oo  it  follows  that  h{Xn )  tends  in  probability  to  the  corresponding  constant: 

h(X„)  4  h(c) . 

In  the  literature,  this  fact  is  also  known  as  Slutsky’s  theorem.  For  this,  we  consider 
an  example. 

Example  14.5  ( Consistency  of  Moment  Estimators)  Fet  {yt}  with  yt  —  pt  +  et  be  a 
white  noise  process  with  expected  value  /x.  For  the  arithmetic  mean  of  a  sample  of 
the  size  n  we  know  from  Example  8.4  (law  of  large  numbers): 

p 

yt  ->  ii , 

i.e.  the  empirical  mean  is  a  consistent  estimator  for  the  theoretical  mean.  Frequently, 
however,  one  is  interested  in  a  parameter  which  is  a  function  of  /x: 


0  —  h(pt) . 
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An  estimator  for  0  constructed  according  to  the  method  of  moments  is  simply  based 
on  the  substitution  of  the  unknown  expected  value  by  its  consistent  estimator: 


On  =  h(yn) . 

Slutsky’s  theorem  as  a  special  case  of  (a)  from  Proposition  14.3  then  guarantees  the 
consistency  of  the  moment  estimator,  provided  h  is  continuous: 

§n  h(fi)  —  0  . 

Such  an  interchangeability  of  some  operation  and  a  mapping  is  by  no  means  trivial. 
It  does  e.g.  not  hold  for  the  expectation  in  general:  For  non-linear  functions  h  one 
has: 


E  (<9„)  =  E  (h(y„))  ±  h  (E(yn))  =  h  (/i)  . 

If  e.g.  {y?}  is  exponentially  distributed  with  the  parameter  A,  i.e. 

P(yt  <  v)  =  1  -  e~Xy  ,  X  >  0 ,  y  >  0 , 

then  it  holds  that 

ji  —  —  ,  and  A  =  h(/i)  =  —  . 

A  ii 

The  function  h  is  continuous  in  /x  >  0,  which  is  why  the  moment  estimator  for  A  is 
consistent: 


A  1  p  1  a 

yn  m 

However,  one  can  show  that  it  holds  for  an  iid  sample  that 


E(An) 


n 


n  —  1 


1 


which  is  why  the  estimator  is  not  unbiased  for  A  in  finite  samples.  ■ 


In  order  to  justify  the  limit  theory  from  the  first  section  (i.e.  in  order  to  prove 
something  like  Proposition  14.1),  mathematicians  have  followed  two  paths.  First, 
the  treatment  of  Xn(t)  e  C[0,  1]  with  the  ordinary  supremum  metric.  For  the  proof 
of  econometric  propositions  as  e.g.  Proposition  14.2  this  has  the  disadvantage  that 
the  impractical  “continuity  appendage”, 


Xn(t)  -  X„ ( t)  =  ( nt  -  \nt\) 


£  [nt]  + 1 

( 7  >  Jn 
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needs  to  be  dragged  along,  cf.  e.g.  Tanaka  (1996).  Second,  the  treatment  of  the  more 
compact  cadlag  functions  Xn(t)  which,  however,  requires  a  more  complicated  metric 
(Skorohod  metric)  and  additional  considerations.  These  mathematical  difficulties 
are  indeed  solved  and  do  not  have  to  bother  us,  see  e.g.  Billingsley  (1968)  or 
Davidson  (1994).  Hence,  we  always  work  with  {Xn(t)}  in  this  book. 


1 4.4  Multivariate  Limit  Theory 

The  multivariate  limit  theory  is  based,  among  others,  on  a  vector  variant  of 
Proposition  14.1  and  yields  generalizations  of  Proposition  14.2  or  Corollary  14.1. 
For  the  sake  of  simplicity,  we  narrow  the  exposition  down  to  the  bivariate  case.  The 
following  elements  of  a  functional  limit  theory  are  kept  sufficiently  general  to  cover 
the  case  of  cointegration  as  well  as  the  case  of  no  cointegration  in  the  following  two 
chapters.  Here,  we  deviate  from  the  convention  of  Sect.  11.4  and  do  not  use  bold 
letters  to  denote  vectors  or  matrices. 


Integrated  Vectors 

The  transposition  of  the  vector  zt  is  denoted  by  z'r  Let  z!t  —  (z\,t,  Zi,t)  be  a  bivariate 
1(1)  vector  with  starting  value  zero  (we  assume  this  for  convenience),  this  means 
both  components  are  1(1).  Then  it  holds  for  the  differences  by  definition, 


that  they  are  stationary  with  expectation  zero,  more  precisely:  Integrated  of  order 
zero.  In  generalization  of  the  univariate  autocovariance  function  we  define 


E  (' wtw',+h )  = 


E(wuwi it+h)  E(wl>tw2yt+h)  A 

E  ( w2,twu+h )  E(w2,tW2,t+h) ) 


Note  that  these  matrices  are  not  symmetric  in  h.  Rather  it  holds  that 


ru-h)  =  K(h) . 


The  long-run  variance  matrix  is  defined  as  a  generalization  of  (14.1), 


oo 

rw(k) 

h=—oo 


( CO l 

V«12  C0l  )  ’ 


cof  >  0 ,  i  —  1 , 2 . 


(14.6) 


This  matrix  is  symmetric  (X2  =  £2')  and  positive  semi-definite  (£2  >  0)  by 
construction;  sometimes  we  omit  the  subscript  and  write  Q  instead  of  £2W.  Note  that 
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£2  cannot  be  equal  to  the  zero  matrix  as  {wt}  is  not  “overdifferenced”  but  1(0).  Nev¬ 
ertheless,  the  matrix  does  not  have  to  be  invertible.  In  the  following  chapter  we  will 
learn  that  the  presence  of  so-called  cointegration  depends  on  the  rank  of  the  matrix. 

Now,  let  W(t)  denote  a  vector  of  the  length  2,  namely  the  bivariate  standard 
Wiener  process.  Its  components  are  stochastically  independent  such  that  this  vector 
is  bivariate  Gaussian  with  the  identity  matrix  /2 : 

The  corresponding  Brownian  motion  is  defined  as  a  vector  as  follows: 


with 


B(t)~Af2(P,tS2), 
£2  =  £20  5(£20  5)' . 


For  the  existence  and  construction  of  a  matrix  5  with  the  given  properties,  which 
is  to  some  extent  a  “square  root  of  a  matrix”,  we  refer  to  the  literature,  e.g.  Dhrymes 
(2000,  Def.  2.35).  However,  concrete  numerical  examples  are  provided  here. 


Example  14.6  (£2  —  £20-5  (£20-5)' ) 


First  consider  a  matrix  of  rank  1 , 


Q 


l 


Now,  let  us  define 


with  P\P\  —  £2 1 . 


Multiplied  by  itself,  Pi  just  yields  the  starting  matrix  Q\.  In  the  second  example  of 
a  diagonal  matrix  with  full  rank, 


9 


the  construction  of  the  square  root  becomes  even  more  obvious: 


Pi  = 


a  i  0 
0  <72 


=  P'2  with  P2P2  —  £2 2  . 
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As  is  a  diagonal  matrix,  P2  —  is  just  generated  by  taking  the  square  root  of 
the  diagonal.  Let  us  consider  a  third  example  with  full  rank: 


Here  it  is  not  obvious  which  form  £23'5 


may  have.  However,  one  can  check  that 


1  /V3  +  1  V3-  1 

2  V  V3-  1  V3  +  1 


=  Pi  with  P3P3  =  123  • 


In  the  case  where  Q  has  full  rank  it  is  actually  easy  to  come  up  with  one  specific 
factorization.  Under  full  rank,  Q  has  a  strictly  positive  determinant  such  that 


h\ 


>  0. 


One  may  hence  define  the  following  triangular  matrix  factorizing  Q  from  (14.6), 


with  T  T  =  (  )  ,  (14.7) 

V  OJ12  co$  J 

which  is  sometimes  called  Cholesky  decomposition  of  Q .  For  L?3  one  obtains  that 
way 


1 

7! 


We  hence  reinforce  the  postulate  that  X2°  5  of  a  matrix  Q  is  not  unique.  In  particular, 
T3  is  not  symmetric,  while  P3  is.  Still,  it  holds  T3  T'3  —  P3P3  —  Q3.  ■ 


Functional  Limit  Theory 

Phillips  (1986)  and  Phillips  and  Durlauf  (1986)  introduced  under  appropriate 
assumptions  multivariate  generalizations  of  Proposition  14.1  into  econometrics: 


n  °-5zW  =  n 


lsn\ 

-°sz 

7=1 


Wt 


n°w5  W(s)  = 


Bi{t) 

Blit) 


(14.8) 
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For  the  individual  components  this  means: 

n~°'5Zi,isn}  =>  Bi(s)  and  «_a5z2,M  =►  B2(s) , 

where  the  Brownian  motions  are  generally  not  independent  of  each  other.  Indepen¬ 
dence  is  only  present  if  Qw  is  diagonal  (co\2  =  0)  as  it  then  holds: 

\  =  (corWiit)} 

\B2(t)J  \co2W2(t)J  ' 

Under  adequate  technical  conditions,  which  need  not  to  be  specified  here,  the 
following  proposition  holds,  cf.  as  well  Johansen  (1995,  TheoremB.13). 


Proposition  14.4  (1(1)  Asymptotics)  Let  {zt}  be  a  2-dimensional  integrated  pro¬ 
cess  and  Azt  —  Zt  —  Zt-i  =  wt  with  E(wt )  =  (0,  0/  and  Q w  from  (14.6).  Then  it 
holds  under  some  additional  assumptions  that 


n 


(a) 

n  h5J2z‘ 

t=  1 

(b) 

n~2  Ztz[ 

t=  1 

(c) 

n~x  T2,ZtWt 

t=  1 


d 


*0.5 

w  I 

JO 


K  /  W(s)ds, 


V°-5  I  W(s)W'(s)ds  (^f  ) 


f 


0.5\f 


n\  °° 

flw5  /  W(s)dW'(s)  (S2°-5)'  +  J2  rw(h) 

h= 0 


as  n  oo. 


Naturally,  these  results  can  be  expressed  in  terms  of  B  —  T2^5  W  as  well 


Q 


0.5 


w 


fl"’5  /  W(s)W(s)ds  «5)'  =  I  B(s)B'(s)ds  , 


f 


[  W(s)ds  =  f  B(s)ds, 
Jo  Jo 

'=/■ 


/  W(s)dW’(s)  (&Z)'  =  I  B(s)dB'(s) . 


f 


:5)'  =  f 

Jo 


The  limit  from  Proposition  14.4  (a)  is  to  be  read  as  a  vector  of  Riemann  integrals, 


f 


(  1  \ 

f  W\  (s)ds 


W  (5)  ds  — 


0 


f  W2(s)ds 
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In  (b),  we  have  a  square  matrix: 


l 


W(s)W'(s)  ds  = 


/  1  i  \ 

/  W\(s)ds  f  W\  (s)W2(s)ds 

0  0 

1  1 

/  W2(s)Wi(s)ds  f  W\ (s)ds 

\o  o  / 


Concluding,  both  these  outcomes  are  results  from  (14.8)  and  from  a  multivariate 
version  of  the  continuous  mapping  theorem,  cf.  Proposition  14.3.  The  third  result 
from  Proposition  14.4,  the  matrix  of  Ito  integrals,  corresponds  to  result  (f)  from 
Proposition  14.2,  also  cf.  Corollary  14.1: 


l 


W(s)dW'(s) 


f  Wi(s)<aWi(s)  /  Wi(s)dW2(s) 


0 


0 


fw2(s)dwl(s)  fw2(s)dw2(s) 


0 


0 


\ 

/ 


In  the  multivariate  setting,  such  a  convergence  cannot  be  elementarily  derived  any 
longer.  For  a  proof  see  e.g.  Phillips  (1988)  or  Hansen  (1992a). 


1 4.5  Problems  and  Solutions 

Problems 

14.1  Derive  with  elementary  means  the  expression  (14.4)  for  the  long-run  variance 
of  the  MA(oo)  process  { et }  from  (14.2). 

14.2  Derive  the  limiting  distribution  of  n~l  YTt=\  xtet  from  Corollary  14.1. 

14.3  Prove  Proposition  14.2(a),  (c),  (d)  and  (e).  When  doing  this,  you  may  assume 
the  functionals  to  be  continuous  with  respect  to  an  appropriate  metric. 

14.4  Prove  Proposition  14.2(b). 

14.5  Prove  Proposition  14.2(f). 

14.6  Check  that  ds(f,  g )  with 

ds(f,  g )  =  sup  \f(t)  -  g(r)|  ,  f,  g&D  [0, 1]  , 

()</<  1 


is  a  metric  (supremum  metric). 
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14.7  Show  that  the  integral  functionals  h\,  and  from  Example  14.4  are 
continuous  on  D  [0, 1]  with  respect  to  the  supremum  metric. 


Solutions 


14.1  First  we  adopt  the  autocovariance  function  from  Proposition  3.2: 

oo 

Ye(h)  =  a2  Y2cjCj+h. 

j= o 


In  (14.4),  the  long-run  variance  is  formulated  as  follows: 


oo 


As  the  coefficients  {cj}  are  absolutely  summable,  the  infinite  sums  can  be  multiplied 
out: 


co: 


o * 


—  Co  Co  +  Co  Cl  +  Co  C2  +  . . 
+  Ci  Co  +  Cl  Cl  +  Cl  C2  + 


+  C2  Cq  +  C2  Ci  +  C2  C2  + 


+  .  .  . 

oo  oo  oo 

=  E! c?  +  2  El  ci  c;'+i  +  2  E  CJ  9+ 2  +  •  •  • 
j=o  j= o  ;=o 

—  —jiYeify  +  2ye(l)  +  2ye(2)  +  ...). 
crz 

Hence,  the  equivalence  of  the  representations  of  the  long-run  variance  from  (14.1) 
and  (14.4)  is  derived. 

14.2  Due  to  xt  —  xt-\  +  et  we  write 


n  n  n 

n~l  =  n~l  )  xt- \  e,  +  n~l  e ]  ■ 

t=  1  t=\  t=  1 


14.5  Problems  and  Solutions 


323 


The  limiting  behavior  of  the  first  sum  on  the  right-hand  side  is  known  from 
Proposition  14.2,  and  the  second  sum  on  the  right-hand  side  tends  to  Var(et )  = 
ye  (0) .  Thus,  elementary  transformations  yield 


n 

n~l  Y2,x<e> 
r=  1 


d  CO 


2  r 


W2(  1)- 


Ye(  0) 


co: 


+  Ve(0) 


CO 


2  r 


W2(  1)  + 


Ye(0) 


co: 


=  co; 


W2(  1)-1  ,  co2  +  ye( 0) 


+ 


2oo2e 


The  application  of  Ito’s  lemma  completes  the  proof. 

14.3 


(a)  The  interval  [0, 1)  is  split  up  into  n  subintervals  of  the  same  length, 


n  i- 


[o,  l)  -  |J 


t=  1  L 


t~  1  t 


n  n 


On  each  of  these  subintervals,  we  define  the  step  function  Xn(s)  as  the 
appropriately  normalized  1(1)  process, 


l  '  1 

X„(i)  =  — - -  VC;  = 

/ VI 


/1ft), 

2=1 


*f-l 


/1ft), 


5  G 


/  -  1  t 


?  I  9 

22  22 


and  on  the  endpoint,  it  holds  for  s  =  1 


X„(l)  = 


1  n 

=  4 -E*,= 


22&>e 

2=1 


22  (22, 


Due  to  Proposition  14.1  we  have  (22  ->  oo) 

V  0)  =>  W(s). 

Furthermore,  we  take  a  trick  into  account  which  allows  for  expressing  a  sum 
over  xt-\  as  an  integral: 


t—  i 


ds  =  Xt-\  s 


l 

n 

t— 1 

n 


t  t  -  1 


=  **-i 


22  22 


*f-l 


22 
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Equipped  with  this,  one  explicitly  obtains 


_  3  n 

n  2 


_I  n 

n  2  x  v  Xf—i 


co . 


7=  1 


* — '  72 


_I  « 
72  2 


CO, 


n 


T,  /  1  Xt_l  c/.s 


7=  1  n 


=  E  L 


7=  1  n 


*7-1 


72  nx 


=  y,  Xn(s)ds 


t=  1 


-f 


=  /  Xn(s)ds. 
fo 


As  we  may  assume  that  the  functional 


-/■ 


Kg)  =  /  gCO* 


is  continuous  with  respect  to  an  appropriate  metric,  Proposition  14.3  yields  for 
72  — >  oo: 


f 


Xn(s)  ds 


d 


f 


W  (s) 


Hence,  the  proof  is  complete. 

(c)  Just  as  for  the  proof  of  (a)  we  define  the  stochastic  step  function 


1  f  1 

Xn  (s)  —  ^  ^  ej  — 

/  in  J  J 


ncoe 

7=1 


*7-1 


nco , 


s  e 


t-  1  t 


72  72 


and  in  addition  the  analogously  constructed  deterministic  step  function 


t  [sn\  +  1 
T„(s)  =  -  =  - 

72  72 


s  e 


t-  1  t 


9  f  9 

72  72 


t  =  1, . . . ,  72,  and  rn(l)  =  1.  Then  it  holds  that 


_  5  « 

72  2 


EA-i  = 


7=1 


_  3  n 

72  2  ^ ^  Xf—  1 

— y>— 

“  72 

e  7=1 

_  3  « 

72  2  ^ 

-  /  1 

C0e  L — '  It- 1 

6  7=1 


Xr-1 


14.5  Problems  and  Solutions 


325 


=  £  L 


t  Xt-l 


=1  n  jn(D 


=  E  L  Tn(S)Xn(s) 


t=  1  "  n 


=  [  Tn(s)Xn(s)ds. 
Jo 


For  yi  — >  oo  it  holds 


Tn(s)Xn(s)  =*  sW(s), 


and  hence,  due  to  the  continuity  of  the  integral  function,  as  claimed 


/' 


Tn(s)  Xn(s)  ds 


f 


s  fF(V)  ds 


(d)  Again,  with  the  definition  of  Xn(s)  it  is  shown: 


_i  lsn\ 

—  £(«  ~  e) 

C0e  , 

t=  1 


n  2 


(%«j  -  w«) 


x„(s)  _  M  £ 

n  z — 4  Jncoe 

t=  l  v 

X„(  1) 

n 

W(s)-sW(  1). 


(e)  The  proof  is  entirely  analogous  to  (a), 


-2  n 


IA-i  = 


(^nQJe)2  “ 


£  L 


t=  1  n 


=  E  /_ 


=/■ 


Xt  (s)  (is 


IT2  <A. 
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14.4  The  result  can  be  shown  in  three  steps, 
(i)  By  definition  it  holds: 


n 


n  1  ^  i  —  n  1  {0  +  e\  +  ( e\  +  e 2)  +  . . .  +  ( e\  +  e2  +  . . .  +  en—\ )} 

t=  1 


—  72  1  { (/i  —  l)^i  +  (n  —  2)e2  +  . . .  +  en— 1} 


n 


=  n  1 


r=l 


=  ^2,et-n  1 


r=i 


r=i 


(ii)  Thus,  the  sum  of  interest  is  reduced  to  known  quantities: 


or 


n  n  n 

n~x  te,  =  ^2  e,  -  n~l  ^  x,-i 
/=1  /=  1  r=  1 


n 


=  Xn  -n  1  2_jxt- 1  > 


/=i 


ft  ft 

_3  Xn  _3 

n  2  2_^  tet  =  w  2  / 

7=1  ^  7=1 


I] A'- 


/' 

t/  0 


<DeW(l)-CDe  W(s)ds, 

'o 

where  Proposition  14.1  for  5  =  1  and  Proposition  14.2  (a)  were  applied, 
(iii)  From  Example  9.1  (a)  with  t  —  1  we  know: 


JT(1)  -  [  W(s)ds  =  [  sdW(s) 

Jo  Jo 


Hence,  the  claim  is  proved. 


14.5  The  first  result  is  based  on  the  binomial  formula  applied  to  xt  =  xt-\  +  et\ 

x2  —  x2_x  +  2xt-\  et  +  e2. 


14.5  Problems  and  Solutions 
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Solving  this  for  the  mixed  term,  we  obtain: 


n 


1  x,- 1  e,  =  -  x?-i  “  e?) 


t=  1 


r=  1 


ft 


-1 


2  2 


?=i 


1  I  /  x 

2 


1  " 
o--y 

ft  ^ 
r=l 


4 


-  Ye{ 0)) 


nr 


W2(  1)  - 


ye(0) 


&>; 


Thus,  we  come  to  the  second  claim.  Obviously,  it  holds  that 


cor  i  ye(0)\  col  (  x  —  ye(0)\ 

e  '  W2(l)  -  )  =  -/  (  w  (!)  -  1  +  —  ‘ 


The  special  case  (10.3)  of  Ito’s  lemma  then  establishes  the  equality  claimed. 

14.6  In  order  to  have  a  metric,  three  conditions  from  the  text  need  to  be  fulfilled. 


(i)  The  condition/  =  g  means 


fit)  =  git ),  t  e  [0, 1], 


which  is  equivalent  to 


\f(t)-g(t) |=0,  t&  [0,1]. 


From  this  it  immediately  follows  ds(f,g )  =  0.  Conversely, 

sup  f(t)-g(t)\  =  0 

0<r<l 


immediately  implies  [ f(t)  —  g(f)|  —  0-  Therefore,  two  functions/  and  g  in  are 
indeed  equal  if  and  only  if  they  have  zero  distance  according  to  the  supremum 
metric. 

(ii)  The  symmetry  condition  is  obviously  given  as  it  holds  for  the  absolute  value: 


1/(0  -g(t)  I  =  IgO) 
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(iii)  Finally,  the  triangle  inequality  can  be  established  by  adding  zero: 

ds(f,  g)  =  sup  [f(t)  -  h(t)  +  h(t)  -  ,i?(r)| 

t 

<  sup  {[f(t)  -  hit)  I  +  I  hit)  -  g(f)|} 

t 

<  sup  1 fit)  -  hit) I  +  sup  [hit)  -  git)\ 

t  t 

=  dsif ,  h)  +  dsih,  g). 


The  second,  rather  plausible  inequality  follows  from  the  properties  of  the 
supremum  metric,  see  e.g.  Sydsaeter,  Strpm,  and  Berck  (1999,  p.77). 

14.7  We  consider  three  functionals  hi(f ),  i  =  1,2,3,  which  assign  a  real  number 
to  the  function  f(t). 


(i)  The  first  functional  is  just  the  integral  from  zero  to  one.  Here  it  holds: 


\hiif)  -  hiig)\  = 


f  fit)dt-  [  git)dt 

Jo  Jo 


<  f  I f(t)~g(t)\dt 

Jo 

<  f  sup  I/O)  — 

Jo  0<s< 1 


=  (  sup  [fis)-gis)\  J  f  dt 

yCK.v^  1  J  Jo 

=  sup  [//)  -  i'(.S')| 

0<.s<  1 

=  dsif,  g)- 


Hence,  the  smaller  the  deviation  off  and  g,  the  nearer  are  h\ (f )  and  h\  (g).  This 
exactly  corresponds  to  the  definition  of  continuity. 

(ii)  The  second  functional  is  the  integral  over  a  quadratic  function.  Here  we  obtain 
with  the  binomial  formula  and  the  triangle  inequality: 


/' 

f 


f 

Jo 


f  (t)dt  -  /  g  it)dt 


f 


(g(t)  -fit))  dt-  2  git)igft)  -fit))dt 


\h2if)  -  h2ig)\  = 


References 
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<  [  k(0  ~f(t)\2dt  +  2  f  \g(t) | \g(t)  -f(t)\dt 

Jo  Jo 

\ 2  J 

<  I  sup  \g(t)-f(t)\  ]  +2  sup  \g(t)-f(t)\  I  | g(t)\dt. 

()</<  1  /  ()</<  1  J  0 


For  the  last  inequality,  it  was  approximated  by  the  supremum  as  in  (i)  and  then 
integrated  from  0  to  1 .  Hence,  it  holds  by  definition 


\h2(f)  -  h2(g) |  <  ( ds(g,f ))2  +  2 ds(g,f)  (  \g(t)\dt. 

Jo 

As  g(t)  belongs  to  D[0, 1]  and  is  thus  absolutely  integrable,  /z2  CO  tends  to  /^(g) 
if  the  distance  between/  and  g  gets  smaller,  which  amounts  to  continuity. 

(iii)  The  third  functional  is  .  Hence,  we  reduce  the  continuity  of  h2  to  the  one  of 
h2: 


h3(f)-h3(g) 


1  1 

fofHOdt  /J  g2(t)dt 

\  fo  g2(t)dt  -  fa  f2(t)dt\ 
fo  f2  0)dt  fo  g2(t)dt 
\hi(g)  -/?2(/)| 
fof2(t)dtfo  g2{t)dt 


Hence,  from  the  (quadratic)  integrability  of  /  and  g  and  the  continuity  of 
follows,  as  required,  the  continuity  of  h^. 
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Trends,  Integration  Tests  and  Nonsense 
Regressions 


15.1  Summary 

Now  we  consider  some  applications  of  the  propositions  from  the  previous  chapter. 
In  particular,  { et }  and  {. xt }  are  integrated  of  order  0  and  integrated  of  order 
1,  respectively,  cf.  the  definitions  above  Proposition  14.2.  It  turns  out  that  the 
regression  of  a  time  series  on  a  linear  trend  leads  to  asymptotically  Gaussian 
estimators.  However,  test  statistics  constructed  to  distinguish  between  integration 
of  order  1  and  0  are  not  Gaussian.  Finally,  we  cover  the  problem  of  nonsense 
regressions,  which  occur  particularly  in  the  case  of  independent  integrated  variables. 


1 5.2  Trend  Regressions 

Let  {y,}  be  trending  in  the  sense  that  the  expectation  follows  a  linear  time  trend, 
E(yr)  =  fi 1.  The  slope  parameter  is  estimated  following  the  least  squares  (LS) 

/V 

method.  The  residuals,  rest  —  yt  —  fit,  are  then  the  detrended  series.  Estimation 
relies  on  a  sample  of  size  n. 


Detrending 

The  time  series  {yj  is  regressed  on  a  linear  time  trend  according  to  the  least  squares 
method.  For  the  sake  of  simplicity,  we  neglect  a  constant  intercept  that  would  have 
to  be  included  in  practice.  The  LS  estimator  fi  of  the  regression 

/V 

yt  =  fit  +  rest ,  t=l,...,n,  (15.1) 
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with  the  empirical  residuals  {rest}  is 


For  the  denominator  the  following  formula  holds 


n  (n  +  1)  (2  n  +  1) 
6 


3  2 

n  n  n 


such  that  one  obtains  asymptotically  (n  ->  oo) 


(15.2) 


1 


1 

3  ' 


This  limit  is  a  special  case  of  a  more  general  result  dealt  with  in  Problem  15.1. 

In  this  chapter  we  consider  two  models  with  a  linear  time  trend  in  the  mean.  First, 
if  the  deviations  from  the  linear  trend  are  stationary,  i.e.  if  the  true  model  reads 


yt  =  f}t  +  et,  (15.3) 

then  we  say  that  {yr}  is  trend  stationary.  More  precisely,  {et}  satisfies  the  assump¬ 
tions  of  an  1(0)  process  discussed  in  the  previous  chapter  with  long-run  variance  co ? 
defined  in  (14.1).  Second,  the  stochastic  component  may  be  1(1).  Then  one  says  that 
{yj  is  integrated  with  drift  (/3  ^  0):  Ayt  —  +  et.  By  integrating  (summing  up) 

with  a  starting  value  of  zero,  this  translates  into 

t 

yt  =  /3t  +  xt ,  t  =  1, . . .  ,n  ,  xt  =  y^g/.  (15.4) 

7=1 

An  example  will  illustrate  the  difference  between  these  two  trend  models. 


Example  15.1  ( Linear  Time  Trend )  In  Fig.  15.1  we  see  two  time  series,  following 
the  slope  0.1  on  average,  t  =  1,2,..., 250.  The  upper  graph  shows  a  trend 
stationary  series, 


y\  ^  —  o.i*  +  £*, 

where  st  ~  ii  Af(0, 1).  The  lower  diagram  shows  with  identical  {e*} 

t 

—  o.i*  +  ^2  £j  ’ 

7=1 
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0  50  100  150  200  250 

integrated  with  drift  (slope  0.1) 


Fig.  15.1  Linear  Time  Trend 

i.e.  is  7(1)  with  drift: 


Ay?  ^  —  o.i  +  st. 

The  deviations  from  the  linear  time  trend  are  in  the  lower  case  /(l)  and  are  hence 
much  stronger  than  in  the  upper  case.  ■ 

Trend  Stationarity 

The  following  proposition  contains  the  properties  of  LS  detrending  under  trend 
stationarity. 

Proposition  15.1  (Trend  Stationary)  Let  {yt}from  (15.3)  be  trend  stationary,  and 

/V 

let  d)e  denote  a  consistent  estimator  for  coe.  It  then  holds  for  ft  from  (15.1)  that 

15P-P  d 

n  — - * 


AT(0, 3) 


(15.5) 


as  n 


00. 


334 


1 5  Trends,  Integration  Tests  and  Nonsense  Regressions 


A  proof  is  provided  in  Problem  15.2  relying  on  Proposition  14.2  (b).  In  practice,  a 

/V 

consistent  estimator  coe  for  coe  will  be  built  from  LS  residuals,  rest  =  yt  —  ft  t,  see 
below  for  details. 

A 

Notice  the  fast  convergence  of  the  estimator  f  to  its  true  value  (with  rate  n15).  In 
real  applications  we  would  typically  calculate  a  trend  regression  with  intercept, 

yt  =  a  +  f)  t  +  rest ,  t  =  1 , . . . ,  n  . 


Hassler  (2000)  showed  that  limiting  normality  and  the  fast  convergence  rate  of  the 
LS  estimator  with  intercept,  /3,  pertains,  although  the  variance  is  affected: 


ni.s  (LJL  4  A''(0. 12); 


CO t 


again,  this  requires  that  coe  constructed  from  the  residuals  is  consistent. 


1(1)  with  Drift 

Now  we  assume  {yj  to  be  integrated  of  order  1,  possibly  with  drift.  Note,  however, 
that  the  following  proposition  does  not  require  /3  ^  0,  as  we  can  learn  from  the 
proof  given  in  Problem  15.3. 

Proposition  15.2  (1(1)  with  Drift)  Let  {yt}  from  ( 15.4)  be  I(  1 ),  possibly  with  drift: 

/\ 

Ayt  —  ft  +  et.  Let  coe  denote  a  consistent  estimator  for  coe.  It  then  holds  for  ft 
from  (15.1)  that 


05^  ~P 

W  — — 

COe 


(15.6) 


as  n  — >  oo. 

Consistent  estimation  of  the  long-term  variance  will  rely  on  the  differences  of  the 
residuals  ({Arest}).  Here,  the  LS  estimator  obviously  converges  much  more  slowly, 
namely  with  the  more  usual  rate  n° -5  instead  of  n15  as  in  the  trend  stationary  case. 
This  does  not  come  as  a  surprise  given  the  impression  from  Fig.  15.1:  Since  the 
trend  stationary  series  follows  the  straight  line  more  closely,  the  estimation  of  the 
slope  is  more  precise  than  in  the  1(1)  case  with  drift. 

A  final  word  on  Proposition  15.2:  The  same  variance  as  in  (15.6)  is  obtained  in 
the  case  of  a  regression  with  intercept,  see  Durlauf  and  Phillips  (1988). 
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Consistent  Estimation  of  the  Long-Run  Variance 

As  we  have  just  seen,  a  consistent  estimation  of  co 4  is  frequently  needed  in 
practice  in  order  to  apply  the  functional  limit  theory.  For  this  purpose,  {et}  is 
to  be  approximated  by  residuals  or  differences  thereof:  In  the  trend  stationary 
case  we  have  rest  —  et,  while  for  1(1)  processes  it  holds  rest  —  xt,  such  that 
Arest  —  et.  The  intuition  behind  an  estimator  col  is  readily  available  from  (14.1), 
col  ~  Ye(fi)  +  2  Ye(h),  although  three  modifications  are  required.  First,  the 

theoretical  autocovariances  need  to  be  replaced  by  the  sample  analogues, 

1  n—h 

Ye(h)  =  -  'Y'etet+h  ■ 

Second,  the  infinite  sum  has  to  be  cut  off  as  sample  autocovariances  can  be 
computed  only  up  to  the  lag  n  —  1.  Third,  in  order  to  really  have  a  consistent 
(and  positive)  estimator,  a  weight  function  w#  (•)  depending  on  a  tuning  parameter 
B  is  needed  (whose  required  properties  will  not  be  discussed  at  this  point,  but  see 
Example  15.2).  In  the  statistics  literature,  w#(-)  is  often  called  a  kernel.  Put  together, 
we  obtain  as  sample  counterpart  to  (14.1): 


/i— l 

COe  =  Ye( 0)  +  2  WB(h)  Ye(h)  ■ 

h= 1 


(15.7) 


For  most  kernels  w#(-),  the  parameter  B  (the  so-called  bandwidth)  takes  over  the 
role  of  a  truncation,  i.e.  the  weights  are  zero  for  arguments  greater  than  B: 

B 

=  Ye(0)  +  2  ^2  ws{h)  Ye(h)  ■ 

h=\ 


In  fact,  the  choice  of  B  is  decisive  for  the  quality  of  an  estimation  of  the  long- 
run  variance.  On  the  one  hand,  B  needs  to  tend  to  infinity  with  the  sample  size, 
on  the  other  it  needs  to  diverge  more  slowly  than  n.  Further  issues  on  bandwidth 
selection  and  choice  of  kernels  have  been  pioneered  by  Andrews  (1991);  see  also 
the  exposition  in  Hamilton  (1994,  Sect.  10.5). 

Example  15.2  ( Bartlett  Weights )  According  to  Maurice  S.  Bartlett  (English  statis¬ 
tician,  1910-2002),  a  very  simple  weight  function  has  a  triangular  form: 

1  —  gTf  ’  h=  1.2,. ...B+l 

0 ,  else 


wB(h)  = 
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Hence,  the  sequence  of  weights  reads  for  h  —  1 , 2, . . . 5  +  1 : 

B  B- 1  B + 1 -h  1 

zT+ r  b+  r  5+1  ’ 5+  r 


Plugging  this  in  (15.7),  the  sum  is  indeed  truncated  at  B: 

B 

<*>2e  =  j>f(0)  +2J2  -  -  ye(h ) . 

h=  1  + 


Although  the  simple  Bartlett  weights  are  by  no  means  optimal,  they  are  widespread 
in  econometrics  up  to  the  present  time,  often  also  named  after  Newey  and  West 
(1987)  who  popularized  them  in  their  paper.  ■ 


1 5.3  Integration  Tests 

We  consider  one  test  for  the  null  hypothesis  that  there  is  integration  of  order  1  and 
one  for  the  null  hypothesis  that  there  is  integration  of  order  0. 


Dickey-Fuller  [DF]  Test  for  Nonstationarity 

The  oldest  and  the  most  frequently  applied  test  on  the  null  hypothesis  of  integration 
of  order  1  stems  from  Dickey  and  Fuller  (1979).  In  the  simplest  case  (without 
deterministics)  the  regression  model  reads 


Xf  —  a xt—\  +  et ,  t  —  1 , ...  ,n  , 


with  the  null  hypothesis 

H0  :  ci  —  1  (i.e.  xt  is  integrated  of  order  1), 


against  the  alternative  of  stationarity,  H\ :  \a\  <  1.  For  the  LS  estimator  a  from 


Xf  —  Cl  Xf—  i  +  8f  ,  t  —  1 


(15.8) 


^any  more  procedures  have  been  developed  over  the  last  decades,  notably  the  test  by  Elliott  et  al. 
(1996)  with  certain  optimality  properties. 
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we  obtain  under //o  the  limiting  distribution  of  the  normalized  LS  estimator  n  (a—  1). 
Often,  one  does  not  work  with  n(a—  1),  but  the  associated  t- statistic: 


hi  — 


£ 

t 


/v  i  2—1  \r^n 

a  -  1  2  s  n  X/=i  et 

— —  wlth  ™ - 2“  =  - 2 -  • 

s«  Ef=  i  V-i  £«=i  *F-i 


The  limiting  distributions  from  Proposition  15.3  are  established  in  Problem  15.4. 


Proposition  15.3  (Dickey-Fuller  Test)  Let  the  1(1)  process  {xt}  (Axt  —  et)  satisfy 
the  assumptions  from  Proposition  14.2.  It  then  holds 


(a)  for  a  from  regression  (15.8)  without  intercept  that 


n(a—  1) 


W2(1)  - 
2./o'  WHs)ds' 


(b)  and  for  the  t-statistic  that 


t 


a 


d 


jul  foW(s)dW(s)+‘ff> 
1  ydty  yj f*  W2(s)  ~ds 


as  n  — >  oo. 

In  this  elegant  form  these  limiting  distributions  were  first  given  by  Phillips  (1987). 
Note  that  the  distributions  depend  on  two  parameters  ye(0)  and  oor  called  “nuisance 
parameters”  in  this  context.  They  are  a  nuisance  because  we  have  to  somehow 
deal  with  them  (remove  their  effect)  without  being  economically  interested  in  their 
values.  Particularly,  if  et  —  et  is  a  white  noise  process,  then  the  first  limit  simplifies 
to  the  so-called  Dickey-Fuller  distribution,  cf.  (1.11)  and  (1.12): 

..  d  W2(\)-\  tiw(s)dW(s) 

2tiw2(s)ds  W2(s)ds 

This  expression  does  not  depend  on  unknown  nuisance  parameters  anymore;  hence, 
quantiles  can  be  simulated  and  approximated.  One  rejects  for  small  (too  strongly 
negative)  values  as  the  test  is  one-sided  against  the  alternative  of  stationarity 
(\a\  <  1).  Similarly,  for  co2e  —  ye( 0)  the  limit  of  the  ^-statistic  simplifies  to  a  ratio 


2 Through  numerous  works  by  Peter  Phillips  the  functional  central  limit  theory  has  found  its 
way  into  econometrics.  This  kind  of  limiting  distributions  was  then  celebrated  as  “non-standard 
asymptotics”;  meanwhile  it  has  of  course  become  standard. 
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free  of  nuisance  parameters: 


'  W(s)  dW(s) 

VJ  =  J-^—=====  .  (15.9) 

yjfo  W2(s)ds 

The  numerator  equals  that  one  of  VTa  from  (1.11). 

In  the  relevant  case  that  {et}  is  serially  correlated,  one  can  run  two  paths  in 
practice.  Firstly,  the  test  statistics  can  be  appropriately  modified  by  estimators  for  cd2 
and  ye( 0).  Phillips  (1987)  and  Phillips  and  Perron  (1988)  paved  this  way.  Secondly, 
one  frequently  calculates  the  regression  augmented  by  K  lags  (ADF  test): 

K 

Xf  =  axt- 1  +  Axt-k  +  et,  t  =  K  +  1, . . .  ,n  , 

k=  l 


or  with  <p  —  a  —  1 


K 

Axt  =  0  xt-i  +  ^2  Axt-k  +  et,  t  =  K  +  1, . . . ,  n  . 

jfe=i 


If  K  is  so  large  that  the  error  term  is  free  of  serial  correlation,  then  the  t-statistic 
belonging  to  the  test  on  a  —  1,  i.e.  0  =  0,  converges  to  the  Dickey-Fuller 
distribution,  and  available  tabulated  percentiles  serve  as  critical  values;  for  further 
details  see  Said  and  Dickey  (1984)  and  Chang  and  Park  (2002).  In  practice,  one 
would  run  a  regression  with  intercept.  This  leaves  the  functional  shape  of  the 
limiting  distributions  unaffected;  only  replace  the  WP  IT  by  a  so-called  demeaned 
WP,  see  also  Problem  15.5. 


KPSS  Test  for  Stationarity 

Now,  the  null  and  the  alternative  hypotheses  are  interchanged.  The  null  hypothesis 
of  the  test  suggested  by  Kwiatkowski,  Phillips,  Schmidt,  and  Shin  (1992)  claims 
that  the  time  series  {yt}  is  integrated  of  order  0  while  it  exhibits  a  random  walk 
component  under  the  alternative  (hence,  it  is  1(1)).  Actually,  this  is  a  test  for 
parameter  constancy.  The  model  reads 


yt  =  ct  +  et,  t  =  1, . . .  ,n, 


with  the  hypotheses 


H0  :  ct  —  c  —  constant, 


H\  :  ct  is  a  random  walk. 
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Under  the  null  hypothesis,  the  intercept  is  again  estimated  by  LS: 

/V  /V  /V  - 

yt  =  c  +  et,  c  —  y , 


et  =  yt  -  y  =  et  -  e  . 


From  this,  the  partial  sum  process  {St}  is  obtained: 


t 


7=1 


Again,  we  assume  that  the  long-run  variance  is  consistently  estimated  from  the 
residuals  et  under  Hq.  Then,  the  test  statistic  is  formulated  as 


The  limiting  distribution  under  the  null  hypothesis  of  stationarity  is  provided  in 
Problem  15.6. 


Proposition  15.4  (KPSS  Test)  Let  the  1(0)  process  {et}  satisfy  the  assumptions 
from  Proposition  14.2,  andyt  =  c  +  et.  It  then  holds  that 

i)  4  f  (W(s)-sW(\)fds=\CM. 

Jo 


as  n  — >  oo. 

This  expression  does  not  depend  on  unknown  nuisance  parameters,  and  critical 
values  are  tabulated.  In  econometrics  one  often  speaks  about  the  KPSS  distribution 
although  this  distribution  has  a  long  tradition  in  statistics  where  it  also  trades 
under  the  name  of  the  Cramer- von-Mises  (CM)  distribution.  Quantiles  were  first 
tabulated  by  Anderson  and  Darling  (1952).  The  limit  CM  is  constructed  from  a 
Brownian  bridge  with  TF(l)  —  11F(1)  =  0,  which  reflects  of  course  that  Sn  —  0  by 
construction. 


Linear  Time  Trends 

Many  economic  and  financial  time  series  are  driven  by  a  linear  time  trend  in  the 
mean.  Consider  a  trend  stationary  process  xt  —  fit  +  et,  which  is  not  integrated  of 
order  1 .  Still,  it  holds  that 


xt  —  xt—\  +  f>  +  et  —  et-\  . 
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Hence,  it  is  not  surprising  that  a  regression  of  xt  on  xt-\  results  in  a  Dickey-Fuller 
statistic  not  rejecting  the  false  null  hypothesis  of  integration  of  order  1.  In  order 
to  avoid  the  confusion  of  a  stochastic  trend  (unit  root  process  integrated  of  order 
1)  and  a  linear  time  trend,  one  has  to  include  time  as  explanatory  variable  in  the 
lag-augmented  regression  estimated  by  LS  (we  also  add  now  a  constant  intercept)  : 

K 

Xt  =  c-\-8t-\-a  xt~\  +  'y  '  dtk  Axt-k  +  £t  >  t  —  K-\-l,...,n. 

k=  1 


Under  the  null  hypothesis  that  {xt}  is  integrated  of  order  1  (possibly  with  drift),  the 
t-statistic  associated  with  a  —  1  obeys  the  following  limiting  distribution: 


~  =  lljmdim 
V fo^dt 


(15.10) 


The  functional  form  is  identical  to  that  from  (15.9),  only  that  the  WP  is  replaced  by 
a  so-called  detrended  WP  W  defined  for  instance  in  Park  and  Phillips  (1988,  p.  474): 

W(s)ds  +12  sW(s)ds  —  -  J  W(s)ds ^  ^  . 

We  call  VJ7  also  the  detrended  Dickey-Fuller  distribution;  critical  values  are 
tabulated  in  the  literature. 

Similarly,  the  KPSS  test  may  be  modified  to  account  for  a  linear  time  trend. 
Simply  replace  the  demeaned  series  et  =  yt  —  y  by  the  detrended  one: 


W(t)  =  W(t )  - 


L 


t 

e,  =  y,  -c  -  jit ,  S,  =  ej  . 

7=1 


Computing  the  KPSS  statistic  fj  from  et  results  asymptotically  in  a  detrended 
Cramer- von-Mises  distribution  (CM  say)  as  long  as  the  null  hypothesis  of  (trend) 
stationarity  holds  true.  Details  on  CM  and  critical  values  thereof  are  given  in 

Kwiatkowski  et  al.  (1992):  fj  CM  with 


CM 


+  (2s  —  3s2)  W{\)  +  (6s2 


2 


W(r)dr 


ds . 


(15.11) 


3  Equivalently,  one  might  feed  detrended  data  into  the  ADF  regression  above. 
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Nonsense  or  spurious  regressions  occur  if  two  integrated  processes  are  regressed 
on  each  other  without  being  cointegrated.  Hence,  we  need  to  start  by  defining 
cointegration  and  by  briefly  recapping  the  standard  statistics  from  the  regression 
model. 


Cointegration 

The  starting  point  for  the  econometric  analysis  of  integrated  time  series  is  the 
concept  of  cointegration,  which  is  also  rooted  in  the  equilibrium  paradigm  of 
economic  theory.  The  idea  of  cointegration  was  introduced  by  Granger  (1981)  and 
has  firmly  been  embedded  in  econometrics  by  the  work  of  Engle  and  Granger 
(1987). 

Let  us  consider  two  integrated  processes  {xt}  and  {yj  integrated  of  order  1. 
Sometimes  we  assume  that  there  is  a  linear  combination  with  b  ^  0  such  that 

yt  -  bxt  =:  vt  (15.12) 

is  integrated  of  order  0.  Here,  y  =  bx  is  interpreted  as  a  long-run  equilibrium 
relation,  as  postulated  by  economic  theory,  from  which,  however,  the  empirical 
observations  deviate  at  a  given  point  in  time  t  by  vt.  If  there  is  no  linear  combination 
of  two  1(1)  processes  that  is  stationary,  then  {vr}  and  {yj  are  called  not  cointegrated. 


Estimators  and  Statistics  in  the  Regression  Model 

We  consider  the  regression  model  without  intercept  (for  the  sake  of  simplicity)  that 
is  estimated  by  means  of  the  least  squares  (LS)  method: 


yt  =  /3xt  +  ut,  t  =  1 (15.13) 

In  this  section  we  work  under  the  assumption  that  {xt}  and  {yt}  are  integrated  of 
order  1  but  not  cointegrated.  Hence,  each  linear  combination,  ut  —  yt  —  /3xt,  is 
necessarily  1(1)  as  well.  Let  {yj  and  {vr}  be  both  components  of  {zt}'. 


Hence,  it  holds  with  (14.8): 

«_053;L™J  =>•  Bi(s)  and  n~°'5x[sni  =>  B2(s). 


342 


1 5  Trends,  Integration  Tests  and  Nonsense  Regressions 


If  the  regressand  yt  and  the  regressor  xt  are  stochastically  independent,  then  this 
property  is  transferred  to  the  limiting  processes  B\  and  B2.  Nevertheless,  we  will 
show  that  then  /3  from  (15.13)  does  not  tend  to  the  true  value  that  is  zero.  Instead, 
a  (significant)  relation  between  the  independent  variables  is  spuriously  obtained. 
Since  Granger  and  Newbold  (1974),  this  circumstance  is  called  spurious  or  nonsense 
regression. 

The  LS  estimator  from  the  regression  without  intercept  reads 


En 

=  t=ixtyt 
P  r2  ' 

The  t- statistic4  belonging  to  the  test  on  the  parameter  value  0  is  based  on  the 
difference  of  estimator  and  hypothetical  value  divided  by  the  estimated  standard 
error  of  the  estimator: 


p-o 

S/3 


r\ 

with  Sp  — 


E"=,^ 


1  " 

=  1E 

t=  1 


u: 


As  a  measure  of  fit,  the  (uncentered)  coefficient  of  determination  is  frequently 
calculated,5 


x^n  u1 

r>2  _  1  _  Z^t=  1  ut 

*^UC  -*■  2  ’ 

2^=i  E 


Finally,  the  Durbin- Watson  statistic  is  a  well-established  measure  for  the  first  order 
residual  autocorrelation, 


dw  = 


Let  us  briefly  recall  the  behavior  of  these  measures  if  we  worked  with  1(0) 

/V 

variables  v  and  y  not  being  correlated  (/3  =  0).  Then,  /3  would  tend  to  0,  the  t- 
statistic  would  converge  to  a  normal  distribution  and  the  coefficient  of  determination 
would  tend  to  zero.  Finally,  the  Durbin- Watson  statistic  would  converge  to  2  (1  — 
Pi)  >  0,  where  p\  denotes  the  first  order  autocorrelation  coefficient  of  the  regression 
errors.  In  the  case  of  nonsense  regressions,  we  obtain  qualitatively  entirely  different 
asymptotic  results. 


4 For  the  following  calculation  of  s 2  we  divide  by  n  without  correcting  for  degrees  of  freedom, 
which  does  not  matter  asymptotically  ( n  —>  00). 

5  “Uncentered”,  as  the  regression  is  calculated  without  intercept. 
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Asymptotics 


Due  to  Proposition  14.4  (b)  it  holds  for  the  denominator  of  the  LS  estimator 


n  a  r 1 

n~2'S^x2t  ->  /  B?(s)ds, 

t=  l 


and  for  the  numerator  we  obtain 


n 


-2 


n  ,  rl 

E**  4  / 

r=  1 


Z?2  (^)  #lC?) 


Both  results  put  together  yield 


a  d  fa  B1(s)B2(s)ds  _  Q 

P  «id2  -  —  P°o- 


fo  B\(s)ds 


In  particular,  if  y?  and  are  stochastically  independent,  then  /3  does  not  tend  to  the 
true  value  0  but  to  the  random  variable  And  as  if  that  was  not  enough,  the  t- 
statistic  belonging  to  the  test  on  the  true  parameter  value  P  —  0  tends  to  infinity 
in  absolute  value!  Hence,  in  this  situation  t-statistics  highly  significantly  reject  the 
true  null  hypothesis  of  no  correlation  and  therefore  report  absurd  relations  as  being 
significant.  This  phenomenon  was  experimentally  discovered  for  small  samples  by 
Granger  and  Newbold  (1974)  and  asymptotically  proved  by  Phillips  (1986).  For 
n  — >  oo  it  namely  holds  that  n~ 0-5  tp  has  a  well-defined  limiting  distribution.  In 
the  problem  section  we  prove  in  addition  the  further  properties  of  the  following 
proposition. 


Proposition  15.5  (Nonsense  Regression)  For  1(1)  processes  {xt }  and  {yj  it  holds 
in  case  of  no  cointegration  with  the  notation  introduced  that 


(a)  P 

(b)  R2UC 


(c)  n  1  s2 


(d)  n  0  5  tp 

(e)  dw 


d 


d 


d 


P 


l 

f  Bps)  B2(s)ds 

0 _  — •  ft 

f  B 2  (s)  ds 
0 

1 

f  Bl(s)ds 

o2  o  .  r2 

/  B\{s)ds 
0 


1 

f  B2 (s)  ds  (l 


as  n 


oo. 


344 


1 5  Trends,  Integration  Tests  and  Nonsense  Regressions 


As  already  has  been  emphasized:  The  results  (a),  (b)  and  (d)  justify  to  speak  of  a 
nonsense  regression.  A  first  hint  at  the  lack  of  cointegration  is  obtained  from  the  first 
order  residual  autocorrelation:  For  nonsense  or  spurious  regressions,  the  Durbin- 
Watson  statistic  tends  to  zero. 

Example  15.3  ( Hendry ,  1980)  Hendry  (1980)  shows  the  real  danger  of  nonsense 
regressions.  In  his  polemic  example,  the  price  development  (measured  by  the 
consumer  price  index  P)  is  to  be  explained.  For  this  purpose,  first  a  money  supply 
variable  M  is  used.  Then  a  second  variable  C  is  considered  (for  which  we  could  think 
of  consumption),  and  it  appears  that  this  time  series  explains  the  price  development 
better  than  M  does.  However,  there  can  in  fact  be  no  talk  of  explanation;  this  is 
a  nonsense  correlation  as  behind  C  the  cumulated  rainfalls  are  hidden  (hence,  the 
precipitation  amount) !  By  the  way,  note  that  P  and  C  are  not  (only)  integrated  of 
order  1  but  in  addition  exhibit  a  deterministic  time  trend,  which  only  aggravates  the 
problem  of  nonsense  regression.  ■ 

If  integrated  variables  are  not  cointegrated,  they  have  to  be  analyzed  in  differences, 
i.e.  Ayt  is  regressed  on  Axt,  resulting  in  the  familiar  stationary  regression  model. 
Naturally,  not  all  integrated  economic  variables  lead  to  nonsense  regressions.  This 
does  not  happen  under  cointegration,  which  brings  us  to  the  final  chapter. 

1 5.5  Problems  and  Solutions 

Problems 

15.1  Show 


t=  l 


for  k  e  N 


as  n  — ►  oo. 


15.2  Prove  Proposition  15.1. 


15.3  Prove  Proposition  15.2. 


15.4  Prove  Proposition  15.3. 


15.5  Derive  an  expression  for  the  limiting  distribution  of  n(a—  1)  under  a  —  1  if  a 
model  with  intercept  is  estimated  in  the  Dickey-Fuller  test: 


xt  —  c  +  a xt-\  +  st ,  t  =  1 , . . . ,  n  . 


Assume  Axt  —  st  to  be  white  noise. 
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15.6  Prove  Proposition  15.4. 

15.7  Prove  the  statements  (b),  (c)  and  (d)  from  Proposition  15.5. 

15.8  Prove  statement  (e)  from  Proposition  15.5. 

Solutions 

15.1  Define  the  continuous  and  hence  Riemann-integrable  function/  on  [0, 1]  with 
antiderivative  F: 


fix)  =  /  with  Fix)  = 


k+  1 


Further,  we  work  with  the  equidistant  partition 


n 


n 


i=i 


Hence, 


which  proves  the  claim. 

15.2  According  to  the  hints  in  the  text,  it  holds 


yn 

Q  —  ^t= 


or 


nh5iP  -P)  = 
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The  denominator  tends  to  |  whereas  Proposition  14.2  (b)  guarantees  the  following 
limiting  distribution: 


nh5(P~P) 


d 


C0e 


./o'  sdw(s) 

1 

3 


From  Example  9.2  it  follows  that 


n1'5  — — — 
3  coe 


L 


sdW  (s) 


If  (j)e  is  replaced  by  a  consistent  estimator, 


coe 


Ode  i 


then  the  result  from  (15.5)  is  established. 

15.3  The  1(1)  case  is  treated  in  an  analogous  way  as  the  trend  stationary  case,  see 
Proposition  14.2  (c): 


n°'5(/3  —  P)  = 


n 


-2.5 


£"=i  txt 


i  + 1  +  j_ 

3^2 n  ^  6 n2 


d  oo e  Jq  sW  (s)  ds 


1 

3 


From  Corollary  8.1  (c)  it  follows  for  c  —  0 


Hence,  (15.6)  is  proved. 

15.4  Under  Hq,  the  LS  estimator  is  given  as: 


En 

a  _  t—  i  xt~\xt  _  i  2^t=  l  xt-iet 


V”  r2 
2^t=  l  xt- 1 


V”  T2 

L^t=  1  Xt- 1 


n(a  —  1)  = 


—1 

n  12=1  V-l 


Insofar  it  holds 
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Thus,  numerator  as  well  as  denominator  are  in  a  form  accessible  for  Proposi¬ 
tion  14.2, 


n(a  —  1) 


f(w2(  1)  -  ) 

(x>l  Jq  W2(s)ds 


fo  W(s)dW(s)  +  yy 
fo  W2(s)ds 


which  is  the  first  distribution  to  be  established. 

To  find  the  second  limit,  one  has  to  handle  the  standard  error  sa  of  a.  It  is  based 
on  the  residual  variance  estimation 


n  1  n 

=  -  yy2  =  -  T.fc  “  dxt-i) 

n  L — 4  n  L — ' 

t=  l  t=  l 


=  -  T]  ((1  ~a)xt- 1  +  e,y 


(l-«) 


2  n 


n 


2(1  -a) 


t=  1 


ft 


n  1  n 

+  -  E 

r=i  r=i 


=  «2(1  -  a)2  1  +  2n(l 


ft' 


a)Er-x,-.«,  +  i^  ; 

ftZ  ft  Z — ' 

(=1 


P 


0  +  2  •  0  +  ye  (0) , 


where  Proposition  14.2  (e),  (f)  and  n~l  YTt=  i  *  yg(0)  as  well  as  the  circumstance 
that  n(  1  —  ft)  converges  in  distribution  were  used.  All  in  all,  it  hence  holds  for  the 
t-  statistic: 


a-  1 

sa 


_  n{a-\)yjn  2E"=i^-i 

(foW(s)dW(s)  +  yy )  ^[JwHs)dS 
VyMfoW2(s)ds 
=  <Oe  W(s)dW{s)  +  £$& 

VM 0)  y[J[wHs)ds 


This  completes  the  proof. 
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15.5  With  the  means 


x  = 


1 

n 


n 


Y,x" 

t=  i 


1 

n 


it  holds  for  the  LS  estimator  when  including  an  intercept  under  a  —  1 : 

_  _  ££,(*,-,  -x-i)x,  _  £"=1(x,_  1  X—  i ) (Xf—  i  +  £,) 

a_  “  e;=1(x,_i  -x-o2 

_  E?=i(*/-i  -*-0  E"=i(^-i 

E"=l  (*t- 1  -  *-l)2  E"=l  (*f-l  -  *-l)2  ' 


Thus,  we  obtain 


—  1) 


n-1  (££,*^-*-1  E:=1  fir) 


ft 


-2 


(E"=i  -*f-i  -«(x-i)2) 


ft 


—  1  —0  5—  —0  5 


n 


n 


-2 


E"=i  *?-i  -  (n~°-5x-i)2 


For  £t  ~  WN(0,cr2)  with  go]  —  o2  —  ye(0)  it  therefore  holds  due  to 
Proposition  14.2  (a),  (e)  and  (f)  and  according  to  Proposition  14.1  for  s  —  1: 


n(a  —  1) 


d 


0-2 /o  W(s)dW(s)  -  a  /;'  W(j)rfso-W(l) 
a2  /0*  W2(s)ds  —  (a  f  ]  W(s)ds)2 

f0l  W(s)dW(s)  -  W( l)f01W(s)ds 
fo  W2(s)ds  -  (/J  W(s)ds)2 


If  one  defines  the  demeaned  WP, 


W(s)  :=  W(s)  -  /  W(r)dr, 


f 


then  one  observes  two  interesting  identities: 


/■ 


W(s)dW(s) 


=  / 
=/■ 
-f 


W(s)dW(s) 


f  W(r)dr  f 

Jo  Jo 


W(s)dW(s )  -  /  W(r)dr  /  JW(» 


f 


W(s)dW(s)  -  /  W(r)drW(l), 
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and 


jf  (W(s))2<fr  =  jf  |  -  2W(  s)  jf  WtrUlr  +  (  j  WirUh  'j  Us 


-f 


Wz(s)ds  —  2  f  W(s)ds  f  W(r)dr+  (  (  W(r)dr 

o  Jo  Jo  \J  0 


=  [  W2(s)ds  — 

Jo  \J  0 


Hence,  the  limiting  distribution  in  the  case  of  a  regression  with  intercept  (which  is 
equivalent  to  running  a  regression  of  demeaned  variables!)  can  also  be  written  as: 


n(a  —  1) 


^  Jo  W(s)dW(s) 

f0lmo)2ds 


Thus,  one  obtains  the  same  functional  form  as  in  the  case  without  intercept,  save 
that  W(s)  is  replaced  by  W(s). 

15.6  For  the  partial  sum  St  we  obtain  under  the  null  hypothesis: 


s>  =  J2*j  = 

7=1  7=1 


The  adequately  normalized  squared  sum  yields 


n 


where  So  =  0  was  used  and  S[sn j  is  the  step  function  from  Proposition  14.2: 


s  [snj  ^  ^)’ 

7=1 


s  e 


l  1  t 


n  n 
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Note  that  Sn  is  zero  by  construction: 

n 

Sn  —  —  e)  —  ne  —  ne  =  0. 

j=  i 


Hence,  we  obtain 


As  there  is  a  continuous  functional  on  the  right-hand  side,  Proposition  14.3  with 
Proposition  14.2  yields: 


n 


-2 


(W(s)  -sW(l))2 


ds. 


If  col  *s  replaced  by  a  consistent  estimator,  then  we  just  obtain  the  given  limiting 
distribution  CM  as  a  functional  of  a  Brownian  bridge.  This  completes  the  proof. 

/V 

15.7  The  limit  /3oo  of  /3  given  in  Proposition  15.5  is  adopted  from  the  text. 

/V 

The  s2  is  based  on  the  LS  residuals  ut  =  yt  —  fixt.  Hence,  the  sum  of  squared 
residuals  becomes 


n 


t=  1 


XN  -2  fiyZytxt + p2  XN 

t=  i  t=\  t=  i 


^  V'1  r2 

r=l  Zw=l  xt 


+ 


A  v2  _  (EXiZiE! 

Z^  V"  r2 

,_i  Zw=l  -E 


=  E??-^2E 


?=i 


r=  1 


Thus,  again  with  Proposition  14.4  (b),  one  immediately  obtains  for  the  uncentered 
coefficient  of  determination: 


—2  xr^n  ~ ? 

.  _  "  E/=i«f 

«-2  E"=i  y? 

i2»-2e:=1e 

n~2T.U  y} 
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d  Plofo  B\(s)ds 


fo  B2l(s)ds 


=:  R2 


OO 


The  relation  used  above  between  the  sum  of  squared  residuals  and  the  coefficient 
of  determination  further  yields 


n 


n 

V  =  —  V 

n 2  i“ 

t=  l 


u < 


n 


=  o  -  Rlc) n  2T,y2t 


t=  l 
2 , 


/  ' 

JO 


O-Rio)  I  B\0)ds 


-  s 2 
*  °00' 


Moreover,  the  asymptotics  of  s2  enables  us  to  state  the  behavior  of  the  t- statistic 
as  well.  The  required  normalization  is  obvious: 


-0.5,  P  \/«  2E"=  iXt 

n  tp  =  — 


n  °-5s 


d  f} 


oo 


'oo  V  do 


f 


B^(s)ds. 


15.8  Due  to  ut  —  yt  —  ft  xu  the  numerator  of  the  Durbin- Watson  statistic  yields: 


n 


n 


n 


1  y^Xu>  -  “t- 1)2  =  n  1  (Ay,  -  fSAx^j 

t=2  t= 2 


n 


=  n  1 


t=  2 


2/6 


0 


2  w 


=  w  'E^-vE^^  +  vE 


w 


2  ,t 


t=  2 


t= 2 


t= 2 


d 


Ki(0)  -  2^oo  •  Cov(wu,  w2,t)  +  filo  YiiO), 
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where  y,(0)  denote  the  variances  of  the  processes  {w^t},  i  —  1, 2.  By  definition  one 
obtains 


ndw 


e;=2(m,-m,-i)2 

—9  x~^n  ~  9 

»  2T.t=lut 


d  yi(0)  -  2^00  •  Cov(wu,  w2,t)  +  Pie  y2(0) 


Consequently,  dw  tends  to  zero  in  probability. 
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Cointegration  Analysis 


16.1  Summary 

This  chapter  is  addressed  to  the  analysis  of  cointegrated  variables.  Properties  like 
superconsistency  of  the  LS  estimator  and  conditions  for  asymptotic  normality  are 
extensively  discussed.  Error-correction  is  the  reverse  of  cointegration,  which  is 
why  we  provide  an  introduction  to  the  analysis  of  error-correction  models  as  well. 
In  particular,  we  discuss  cointegration  testing.  In  2003,  Clive  W.J.  Granger  was 
awarded  the  Nobel  prize  for  introducing  the  concept  of  cointegration.  Finally,  we 
stress  once  more  the  effect  of  linear  time  trends  underlying  the  series. 


1 6.2  Error-Correction  and  Cointegration 

Before  cointegration  had  been  launched,  so-called  error-correction  models 
impressed  due  to  their  empirical  performance,  cf.  e.g.  Davidson,  Hendry,  Srba, 
and  Yeo  (1978).  Today  we  know  that  these  models  are  just  the  other  side  of 
the  cointegration  coin.  By  way  of  example,  the  key  statement  of  Granger’s 
representation  theorem  from  Engle  and  Granger  (1987)  is  illustrated,  which 
demonstrates  the  fact  that  error-correction  and  cointegration  are  equivalent. 


Autoregressive  Distributed  Lag  Model 

Let  us  consider  a  dynamic  regression  model  in  which  {yj  has  an  autoregressive 
structure  on  the  one  hand  and  which  is  explained  by  (lagged)  exogenous  variables 
xt-j  on  the  other: 


yt  —  a\  yt— i  +  •••  +  %>  yt-p  co  xt  -\-  c\  xt-\  +  •  •  •  +  q  xt~i  +  et . 


©  Springer  International  Publishing  Switzerland  2016 

U.  Hassler,  Stochastic  Processes  and  Calculus ,  Springer  Texts  in  Business 

and  Economics,  DOI  10.1007/978-3-319-23428-l_16 


353 


354 


16  Cointegration  Analysis 


Hence,  this  is  an  extension  of  the  AR (p)  process.  Because  of  the  additional 
exogenous  explanatory  variables,  we  sometimes  speak  of  AR X(p,  l)  models, 
although  such  processes  are  more  often  called  autoregressive  distributed  lag 
models,  ARDL (p,  i).  We  assume  that  {vr}  is  integrated  of  order  one.  In  order  to  have 
cointegration  with  {yj,  the  ARDL  model  has  to  be  stable.  We  adopt  the  stability 
condition  from  Proposition  3.4: 


1  —  a\  z - apzp  —  0 


z 


>  1 


Example  16.1  (ARDL(2,2))  With  p  —  l  —  2  we  consider 


yt  =  a  1  yt- 1  +  a2  yt- 2  +  c0  xt  +  c  1  xt-\  +  c2  xt-2  +  et . 


Due  to  the  assumed  stability,  the  parameter  b  can  be  defined: 

Co  +  C\  +  c2 

b  - . 

1  —  a\  —  a2 

In  fact,  the  denominator  is  not  only  different  from  zero  but  positive  if  stability  is 
given  (which  we  again  know  from  Proposition  3.4):  \  —  a\  —  a2  >  0.  Thus,  the 
following  parameter  y  is  negative: 

y  :=  —(1  —  a  1  —  a2)  <  0  . 

Elementary  manipulations  lead  to  the  reparameterization  (cf.  Problem  16.1) 

Ayt  =  y  [yt-i  -  bxt- 1]  -  a2  Ayt-\  +  c0  Axt  -  c2  Axt- 1  +  £t .  (16.1) 

In  this  equation  differences  of  y  are  related  to  their  own  lags  and  differences  of 
v.  In  addition,  they  depend  on  a  linear  combination  of  lagged  levels  (in  square 
brackets).  This  last  aspect  is  the  one  constituting  error-correction  models.  The 
cointegration  relation  y  —  bx  is  understood  as  a  long-run  equilibrium  relation 
and  vt-\  —  yt-\  —  bxt-\  as  a  deviation  from  it  in  t  —  1.  This  deviation  from  the 
equilibrium  again  influences  the  increments  of  yt.  The  involved  linear  combination 
of  yt-\  and  xt-\  needs  to  be  stationary  because  {Ayt}  is  stationary  by  assumption. 
Hence,  in  the  example  it  is  obvious  that  such  a  relation  between  differences  and 
levels  implies  cointegration.  Indeed,  it  is  the  lagged  deviation  from  the  equilibrium 
vt-\  influencing  the  increments  of  Ayt  with  a  negative  sign.  If  yt-\  is  greater  than  the 
equilibrium  value,  vt-\  >  0,  then  this  affects  the  change  from  yt-\  to  yt  negatively, 
i.e.  y  is  corrected  towards  the  equilibrium,  and  vice  versa  for  values  below  the 
equilibrium  value.  What  economists  know  as  deviation  from  the  equilibrium  is 
called  “error”  (in  the  sense  of  a  deviation  from  a  set  target)  in  engineering,  which 
explains  the  name  error-correction  model.  ■ 
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The  example  can  be  generalized.  Cointegrated  ARDL  models  of  arbitrary  order  can 
always  be  formulated  as  error-correction  models.  This  is  not  surprising  against  the 
background  of  Granger’s  representation  theorem. 


Granger's  Representation  Theorem 

Error-correction  adjustment  is  the  downside  to  cointegration.  The  relation  between 
cointegration  and  error-correction  is  explained  by  the  following  proposition  where 
we,  however,  will  not  spell  out  all  the  technical  details.  The  result  goes  back  to 
Granger,  cf.  Engle  and  Granger  (1987)  or  Johansen  (1995,  Theorem  4.2). 

Proposition  16.1  (Representation  Theorem)  Let  {yt}  and  {xt}  be  integrated  of 
order  one.  They  are  cointegrated  if  and  only  if  they  have  an  error- correction 
representation, 


p  t 

Ay,  =  y  v,-\  +  ^2  ai  Ay>-j  +  E  ai  Ax‘~j  +  s‘  ’ 

7=1  7=1 


Px  lx 

Ax,  —  yx  v,-i  +  E  aT  Ay>-J  +  E  ajX)  Ax,~i  +  Sx’t  ’ 

7=1  7=1 

v,  =  y,  —  b x,  ~  1(0) ,  b  7^  0, 


(16.2) 

(16.3) 

(16.4) 


where  at  least  one  of  the  so-called  adjustment  coefficients  y  or  yx  is  different  from 
zero. 


Of  course,  not  all  a\  ( a |x))  and  ct;  (a^)  need  to  be  different  from  zero.  The  error 
sequences  {ej  and  {£*,?}  are  white  noise  and  may  be  contemporaneously  correlated. 
Frequently  in  practice,  additional  contemporaneous  differences  of  the  respective 
other  variable  are  hence  incorporated  on  the  right-hand  side.  For  Eq.  (16.2)  this 
means  e.g.  the  inclusion  of  oio  Axt.  Then,  one  sometimes  also  speaks  of  the 
conditional  or  structural  error-correction  equation. 


Cointegration  and  the  Long-Run  Variance  Matrix 

In  (14.6)  the  symmetric  long-run  variance  matrix  T2  of  a  stationary  vector  has  been 
defined.  We  now  consider  an  1(1)  vector  z!t  —  (z\,t,  zi,t)  su°h  that  Azt  —  (w \it,  W2,t)' 
is  integrated  of  order  zero,  which  is  why  Q  cannot  be  equal  to  the  zero  matrix. 
Nevertheless,  the  matrix  does  not  have  to  be  invertible.  It  rather  holds:  If  the 
vector  {zt}  is  cointegrated,  then  Q  has  the  reduced  rank  one  and  is  not  invertible. 
Equivalently,  this  means:  If  Q  is  invertible,  then  {z\it}  and  {z2,t}  are  not  cointegrated. 
To  this,  we  consider  the  following  example. 
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Example  16.2  (£2  under  Cointegration )  In  this  example,  let  {zi.t}  be  a  random 
walk,  the  cointegration  parameter  be  one,  and  the  deviation  from  the  equilibrium 
vt  =  £\  t  be  iid  and  independent  of  {Az2,tY 


z\,t  —  Z2,t  +  £\,t » 

Z2,t  ~  Z2,t-\  +  s2,t  • 

Hence,  if  and  {sij}  are  independent  with  variances  of  and  of ,  then  one  shows 
with  wij  =  Az\,t  =  £2,t  +  £\ ,t  ~  £i,t-i  and  w2,t  =  Az2,t  =  £2,6 


rw( o)  = 


_  ( ol+2ol  o\ 


On 


o~ 


and  rw( 1)  = 


-of  0 
0  0 


For  h  >  1,  Fw(/?)  =  0.  Thus,  it  holds: 


0  2  '  11 

—  09  , 

'll 


i.e.  the  matrix  is  of  rank  one  and  not  invertible. 


Conversely,  it  holds  as  well  that  full  rank  of  £2  follows  from  the  absence  of 
cointegration.  This  is  to  be  illustrated  by  the  following  example. 


Example  16.3  (£2  without  Cointegration )  Now,  let  {z\,t}  and  {z2,t}  be  two  random 
walks  independent  of  each  other: 


Z\ ,t  —  Z\,t- 1  +  £\ ,t , 

Z2,t  ~  Z2,t-\  +  £2 ,t  • 

Since  {£u}  and  {£2,1}  are  independent  with  variances  of  and  of,  one  shows  with 

wu  =  £\ ,t  and  w2,t  =  £2,/: 


For  h  >  0,  rw(h)  —  0.  Thus,  it  holds: 

C2W  —  E v(0) , 

where  this  matrix  has  full  rank  2  in  the  case  of  positive  variances  and  is  thus 
invertible.  ■ 

Hence,  the  presence  of  cointegration  of  the  1(1)  vector  {zt}  depends  on  the  matrix 
C2.  The  examples  show  that  no  cointegration  of  {zt}  is  equivalent  to  the  full  rank  of 
Q,  cf.  Phillips  (1986). 


1 6.2  Error-Correction  and  Cointegration 


357 


Linearly  Independent  Cointegration  Vectors 

For  more  than  two  1(1)  variables  linearly  independent  cointegration  vectors  can 
exist.  In  such  a  situation  there  can  be  no  talk  of  “the  true”  cointegration  vector.  Each 
and  every  linear  combination  of  independent  cointegration  vectors  is  itself  again  a 
cointegration  vector.  Although  this  cannot  occur  under  our  assumption  of  a  bivariate 
vector,  we  want  to  become  aware  of  this  problem  by  means  of  a  three-dimensional 
example. 

Example  16.4  ( Three  Interest  Rates )  Assume  zu  Zi  and  Z3  to  be  interest  rates  for 
one-month,  two-month  and  three-month  loans  integrated  of  order  one.  Then  one 
expects  (due  to  the  expectations  hypothesis  of  the  term  structure)  the  interest  rate 
differentials  z\  —  Zi  and  zi  —  z. 3  to  provide  stable  relations: 

Zl,t-Z2,t  =  vu  ~  /(0), 

Z2,t  ~  Z3,t  ~  V2,t  ~  1(0). 

Here,  b\  —  (1,  —1, 0)  and  b2  =  (0, 1,  —1)  are  linearly  independent,  and  both  are 
cointegrating  vectors  for  z!t  —  (z\,t,  z2,t,  Z3,t)  as  v\it  and  V2 ,t  are  both  assumed  to  be 
1(0): 


Hence,  the  uniqueness  of  the  cointegration  vector  is  lost.  It  is  rather  that  b\  and  b2 
form  a  basis  for  the  cointegration  space.  Each  vector  contained  in  the  plane  they 
span,  i.e.  each  linear  combination  of  b\  and  b2,  is  itself  again  a  cointegration  vector. 
For  example  for 


Pa  =  b  1  +  (1  -  ot)b2  — 


one  obtains  a  stationary  relation  comprising  all  three  variables  from  zt, 

Zu  —  aZ2,t  +  (1  —  0i)z3,t  +  V\ft  +  (1  —  Oi)V2,t  , 

where  v\it  +  (1  —  a)v2,t  is  1(0).  In  particular  for  a  —  0  one  sees  that  z\,t  and  Z3,t 
alone  are  also  cointegrated  with  the  cointegration  vector  /3'0  =  b\  +b'2  =  (1 , 0,  —  1). 
The  cointegration  vectors  b\,  b2  as  well  as  fio  provide  economically  reasonable  and 
theoretically  secured  statements  on  the  interest  rate  differentials.  However,  for 
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an  arbitrary  a  is  just  as  good  a  cointegrating  vector.  Hence, 


z\  =  olzi  +  (1  -  a)z3 


for  a  1  ora  ^  0  is  as  well  a  “true”  long-run  equilibrium  relation,  even  if,  in 
contrast  to  the  interest  rate  differentials,  it  is  not  readily  amenable  to  an  economic 
interpretation.  These  problems  with  the  interpretation  of  more  than  one  linearly 
independent  cointegration  vectors  are  of  a  fundamental  nature  and  cannot  be  solved 
unless  one  makes  a  priori  (economically  plausible)  assumptions  on  the  form  of  the 
cointegrating  vectors.  The  cointegration  analysis  is  not  the  life  belt  saving  us  from 
the  lack  of  identification:  With  purely  statistical  methods  one  generally  cannot  make 
economic  statements.  ■ 

For  only  two  1(1)  variables  {yt}  and  {xt},  linearly  independent  cointegration  vectors 
cannot  exist.  We  show  this  by  contradiction.  Hence,  let  us  assume  with  b\  ^  b2 
from  R  that  two  linearly  independent  relations  exist.  We  collect  them  row- wise  in 
the  matrix  B : 


~b  i 

-b2 


Then,  it  holds  by  assumption  that 

B(y,\  =  (y,-hx<\  =  (vu\ 

\xtJ  \yt-b2xt )  \v2  j) 

is  a  vector  of  1(0)  variables  {v\tt}  and  {v2 ,*}.  Due  to  the  independence  of  the 
cointegration  vectors,  B  is  invertible: 

(:::)- 

This  yields  {y?}  and  {xt}  as  linear  combinations  of  i  =  1,2,  from  which 

it  follows  that  {yj  and  {xt}  themselves  have  to  be  1(0),  which  contradicts  the 
assumption. 

16.3  Cointegration  Regressions 

In  the  case  of  bivariate  cointegration,  the  LS  estimator  of  a  static  regression  of  yt 
on  xt  tends  to  the  true  value  with  the  sample  size  (i.e.  with  rate  n).  For  this  fast  rate 
of  convergence  the  term  superconsistency  has  been  coined  in  the  literature.  At  the 
same  time  limiting  normality  only  arises  under  additional  assumptions. 
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Superconsistent  Estimation 

We  consider  the  LS  estimator  regressing  yt  on  xt  under  the  assumption  (16.4)  that 
cointegration  is  present.  For  the  sake  of  simplicity,  we  again  do  not  allow  for  an 
intercept  (which  is  why  we  assume  that  the  1(1)  processes  have  the  starting  value 
zero): 


/V 

y,  =  bx,  +  vt, 


Then  we  write  for  the  LS  estimator: 


or 


b  —  b 


n  1  Y  xtvt 

t=  l 


n  2  A 

t=  l 


(16.5) 


To  be  able  to  apply  the  functional  limit  theory  from  Chap.  14,  we  now  define 


wt  = 


(16.6) 


instead  of  z!t  =  (yt,  xt )  as  in  Sect.  15.4  without  cointegration.  Then  it  holds  with  the 
results  from  Proposition  14.4(b)  and  (c): 

1  oo 

/ B2(s)dBi(s )  +  Y  E(Ax,vt+h) 

n  j  \  d  0  h=0 

n(b  —  b)  — >  - - - . 

f  B\(s)ds 
o 

As  the  LS  estimator  tends  to  the  true  value  with  the  sample  size  n  instead  of  with 
only  n0  5  as  for  the  stationary  regression  model,  since  Stock  (1987)  and  Engle  and 
Granger  (1987)  it  has  become  common  usage  to  speak  of  superconsistency  of  the 
static  cointegration  estimator  from  (16.5);  however,  this  result  has  been  known  from 
Phillips  and  Durlauf  (1986)  already. 

Note  that  the  estimation  of  b  is  consistent  despite  possible  correlation  between 
error  term  vt  and  regressor  xt  (or  Axt).  Insofar  the  cointegration  regression  knocks 
out  the  simultaneity  bias  (or  “Haavelmo  bias”):  Superconsistency  is  a  strong 
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asymptotic  argument  for  single  equation  regressions  despite  possibly  existing 
dependencies  through  simultaneous  relations  between  the  individual  equations, 
i.e.  despite  correlation  between  regressors  and  error  term.  According  to  this,  the 
cointegration  approach  can  be  understood  as  a  reaction  to  the  simultaneous  equation 
methodology  of  former  decades.  At  the  same  time,  it  is  not  clear  anymore  which 
variable  constitutes  the  endogenous  left-hand  side  and  which  quantity  identifies  the 
exogenous  regressor.  Beside  (16.4),  it  also  holds  as  a  “true  relation”  that 

_  y_t_  _  vt_ 

T  b  b 

Hence,  if  xt  is  regressed  on  yt,  then  one  would  obtain  analogously  a  superconsistent 
estimator  for  b~{ .  This  vehemently  contrasts  the  results  of  the  stationary  standard 
econometrics  where  the  (asymptotic)  validity  of  LS  crucially  depends  on  the  correct 
specification  of  the  single  equation  and  exogeneity  assumptions. 


Further  Asymptotic  Properties 

In  Problems  16.2  and  16.3  we  prove  the  further  properties  of  the  following 
proposition.  Here,  the  standard  errors  of  the  t-statistic,  the  uncentered  coefficient 
of  determination  and  the  Durbin- Watson  statistic  are  defined  as  in  Sect.  15.4. 


Proposition  16.2  (Cointegration  Regression)  For  cointegrated  1(1)  processes  {xt } 
and  {yt}  it  holds  with  (16.4)  and  the  notation  introduced  that 


/V 

(a)  n(b  —  b) 

(b)  n(l-Rl) 

(c)  s2 

(d)  tb  =  4± 

(e)  dw 


d 


d 


P 


J  B2{s)dB\{s)+ Axv 


l 

j  B^(s)ds 
0 

yi(Q) 


i 


b 2  /  B  2  ( s)ds 
o 

n  (0) , 

l 

/B2(s)dB1(s)+A 


Y  l  (0  ){B22(s)ds 


2(1 -Ml)), 


where 

oo 

Are  :=  y^£( AxtVt+h)  and  y,  (0)  :=  Var(vt) 

h= 0 


as  n 


oo. 
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Three  remarks  are  to  help  with  the  interpretation,  (i)  Note  again  that  supercon¬ 
sistency  holds  even  if  the  regressors  xt  correlate  with  the  error  terms  vt.  This 
nice  property,  however,  comes  at  a  price:  Unfortunately,  the  limiting  distributions 
from  Proposition  16.2(a)  and  (d)  are  (without  further  assumptions)  generally  not 
Gaussian,  (ii)  As  a  rule,  for  regressions  involving  trending  (integrated)  variables 
one  empirically  observes  values  of  the  coefficient  of  determination  near  one,  which 
is  explained  by  Proposition  16.2(b).  Due  to  that,  for  trending  (integrated)  time 
series  the  coefficient  of  determination  cannot  be  interpreted  as  usual:  Since  {yt} 
does  not  have  a  constant  variance,  the  coefficient  of  determination  does  not  give 
the  percentage  of  the  variance  explained  by  the  regression,  (iii)  At  no  point  it  was 
assumed  that  the  error  terms,  { vt },  are  white  noise.  For  the  first  order  residual 
autocorrelation,  it  holds 


En — 1  ~  ~ 

t=  i  VtVt+ 1  P 


Mi) 


Epw+i) 

Var(Dr) 


which  is  just  reflected  by  the  behavior  of  the  Durbin- Watson  statistic. 


Asymptotic  Normality 

The  price  for  the  superconsistency  without  exogeneity  assumption  is  that  the 
limiting  distribution  of  the  t-statistic  is  generally  not  Gaussian  anymore.  However, 
if  vt  and  Ax?  are  stochastically  independent  for  all  t  and  s,  then  Kramer  (1986) 
shows  that  asymptotic  normality  arises.  This  assumption,  however,  is  stronger  than 
necessary.  It  suffices  to  require,  first,  Axv  —  0  and,  second,  that  the  Brownian 
motions  B\  and  B2  are  independent.  For  independence  of  B\  and  B2  we  only  need 

00 

CO  12  =  ^2  E(Ax‘vt+h)  =  0  (16.7) 

h=—o O 

as  under  this  condition  it  holds  that  B{  —  cotWi.  Due  to  Proposition  10.4  the 
following  corollary  is  obtained  (also  cf.  Problem  16.4). 

Corollary  16.1  (Asymptotic  Normality)  If  Axv from  Proposition  16.2  is  zero,  then 
it  holds  under  (16.7)  that 


tb 


1 


as  n  — ►  00. 

If  one  has  consistent  estimators  for  the  variance  and  the  long-run  variance,  then 
the  t-statistic  can  be  modified  as  follows  and  can  be  applied  with  standard  normal 
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distribution  asymptotics  under  Corollary  16.1: 


rb  := 


A/\0, 1). 


For  the  estimation  of  the  long-run  variance,  we  refer  to  the  remarks  in  Sect.  15.2,  in 
particular  Example  15.2.  As  {vt}  itself  is  not  observable,  yi(0)  and  cof  have  to  be 

/V 

calculated  from  vt  =  yt  —  bxt. 


Efficient  Estimation 

The  assumptions  Axv  —  0  and  con  =  0  required  for  Corollary  16.1  are  often  not 
met  in  practice.  We  now  consider  modifications  of  LS  resulting  in  limiting  normality 
that  get  around  such  restrictions.  To  that  end  we  have  a  closer  look  at  the  LS  limit 

/V 

from  Proposition  16.2  called  now  C(b)\ 

n{b  —  b)  — >  C(b)  . 


We  define  the  process  B\.2, 

B\.2(s)  :=  #i0)  -  coi2<j02  2B2(s),  (16.8) 

in  such  a  way  that  it  does  not  correlate  with  B2. 

E(Z?i.2(r)Z?2(X))  =  min(r,  s)a)u  —  min(r,  s)(D\2(0^2cd%  —  0. 

Because  of  normality  the  two  processes  are  thus  independent.  The  variance  of  the 
new  process  is 

Var(Bh2(s))  =  E (B?.2(s»  =  sco2h2  ,  w2h2  :=  to2  -  w2uoj22 . 

With 


Bi(^)  =  Bh2(s)  +  B2(s)co2  2 con 


we  may  now  decompose  the  LS  limit  as  follows: 


B2is)dB2is)co2  2coi2  + 


—  C\  +  C2  +  £3 . 
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The  first  component,  C\ ,  is  conditionally  normal  with  mean  zero,  i.e. 


jC\  I  Z?2  ~  Af 


9 


which  is  true  by  Proposition  10.4.  The  second  component  defined  as  a  multiple  of 

/o1  B2(s)dB2(s)  —  (B](  1)  —  l)/2  is  stochastic  and  introduces  skewness  into  £(£), 
while  finally  this  distribution  is  shifted  deterministically  by  Axv .  Consequently,  for 
arbitrary  e  >  0  one  has 


PGA |  <e)>P(\C(b)\  <e). 


see  Saikkonen  (1991,  Theorem  3.1)  Saikkonen.  In  that  sense,  C\  is  more  concen- 

A 

trated  around  zero  than  C(b)  in  general,  such  that  intuitively  the  LS  estimator  is 
closest  to  zero  for  con  —  Axv  =  0.  This  is  the  intuition  for  the  following  definition: 
Let  b+  denote  a  cointegration  estimator  with 


d 


b)  — >  C\ 


where  C\ 


B2  ~  AT 


(16.9) 


for  some  positive  constant  <z>;  then  is  said  to  be  efficient  (see  Saikkonen  1991  for 
a  more  general  discussion).  To  further  justify  this  notion  of  efficiency  we  note  that 
full  information  maximum  likelihood  estimation  of  a  cointegrated  system  results  in 
exactly  this  distribution,  see  Phillips  (1991). 

Efficient  cointegration  regressions  are  not  only  interesting  because  they  achieve 
the  lower  bound  for  the  standard  error;  more  importantly,  related  Etype  statistics 
are  asymptotically  normal,  which  allows  for  standard  inference.  Suppose  we  have 
an  efficient  estimator  satisfying  (16.9),  and  that  cb  (typically  computed  from 
cointegration  residuals)  is  consistent  for  co.  Then  we  define  the  Etype  statistic 


From  the  previous  discussion  it  follows,  see  Phillips  and  Park  (1988): 


t+  4  vco,  i) . 


Consequently,  the  caveat  of  superconsistent  LS  cointegration  regression,  lacking 
normality  in  general,  is  overcome  by  efficient  estimators. 

Let  us  repeat  once  more:  LS  cointegration  estimation  is  efficient  under  the  not 
very  realistic  assumption  that  con  —  Axv  =  0.  Several  modifications  of  LS 
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achieving  efficiency  without  this  assumption  have  been  proposed.  First,  the  so- 
called  dynamic  LS  estimator  suggested  independently  by  Saikkonen  (1991)  and 
Stock  and  Watson  (1993)  is  settled  in  the  time  domain,  see  also  Phillips  and  Loretan 
(1991);  second,  so-called  frequency  domain  based  modifications  of  LS  have  been 
suggested  by  Phillips  and  Hansen  (1990)  (“fully  modified  LS”)  or  Park  (1992) 
(“canonical  cointegrating  regression”);  they  all  meet  (16.9)  and  ~  A/\0, 1), 
asymptotically. 


Linear  Time  Trends 

In  Sect.  15.2  we  considered  linear  time  trends  and  1(1)  processes,  i.e.  so-called 
integrated  processes  with  drift: 


xt  —  [i  xt~\  +  et ,  fi  0. 


(16.10) 


By  repeated  substitution  one  obtains 


t 

Xt  —  Xq  fl  t  -\-  ^  '  6j , 

j=  1 

i.e.  {xt}  consists  of  a  linear  trend  of  the  slope  pt  and  an  1(1)  component;  and  of  a 
starting  value  whose  influence  can  be  neglected  such  that  we  set  xq,  —  0  w.l.o.g. 
In  addition,  let  the  cointegration  relation  (16.4)  hold  true.  Consequently,  {yj  as 
well  exhibits  a  linear  trend  of  the  slope  b  p.  The  cointegration  relation  (16.4) 
simultaneously  eliminates  the  deterministic  linear  time  trend  and  the  stochastic  1(1) 
trend  from  both  series.  In  this  case  the  static  LS  regression  from  (16.5)  yields  an 
even  faster  rate  of  convergence  ( n 1,5  instead  of  n)  and  simultaneously,  the  limiting 
distribution  is  Gaussian.  The  following  proposition  is  a  special  case  of  the  more 
general  results  from  West  (1988).  We  prove  it  in  Problem  16.5. 

Proposition  16.3  (West)  We  assume  cointegrated  1(1)  processes  {vr}  and  {yj  with 
drift  (i.e.  (16.10)  with  (16.4)).  Then  it  holds  for  the  regression  (16.5)  without 
intercept  that 


nh5(b-b ) 

as  n  —>  oo,  where  cof  is  the  long-run  variance  of  {vt}  from  (16.4). 

For  the  sake  of  completeness,  note  that  the  asymptotics  from  Proposition  16.3  is 
qualitatively  retained  if  the  regression  is  run  with  intercept.  However,  the  variance  of 
the  Gaussian  distribution  is  changed.  It  becomes  \2co\/ p2.  For  practical  inference, 
p}  has  to  be  estimated  from  or  {Axt},  while  of\  can  be  estimated  consistently 
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from  the  cointegration  residuals  vt.  For  ji  —  1  the  limiting  distribution  from 
Proposition  16.3  equals  that  from  (15.5),  which  is  not  coincidental,  see  also  the 
proof  in  Problem  16.5:  The  scalar  1(1)  regressor  with  drift  is  dominated  by  the 
linear  time  trend;  hence,  the  cointegration  regression  amounts  to  a  trend  stationary 
regression. 

Note,  however,  that  Proposition  16.3  holds  only  in  our  special  case  that  xt  is  a 
scalar  1(1)  variable.  If  one  has  a  vector  of  1(1)  regressors  with  drift  instead  of  a 
scalar  variable,  then  the  asymptotic  normality  in  general  does  not  hold  anymore, 
and  the  very  fast  convergence  with  rate  n 1,5  is  lost  as  well.  Instead,  Hansen  (1992) 
proved  results  in  line  with  Proposition  16.2  if  there  are  several  1(1)  regressors  of 
which  at  least  one  has  a  drift. 


1 6.4  Cointegration  Testing 

If  one  regresses  nonstationary  (integrated)  time  series  on  each  other,  the  inter¬ 
pretation  of  the  regression  outcome  largely  depends  on  whether  the  series  are 
cointegrated  or  not.  Hence,  one  has  to  test  for  the  absence  or  presence  of  cointe¬ 
gration. 


Residual-Based  Dickey-Fuller  Test 

The  idea  of  the  following  test  for  the  null  hypothesis  of  no  cointegration  dates 
back  to  Engle  and  Granger  (1987),  although  a  rigorous  asymptotic  treatment  was 
provided  later  by  Phillips  and  Ouliaris  (1990).  The  idea  is  very  simple.  Without 
cointegration  any  linear  combination  of  1(1)  variables  results  in  a  series  that  too 
has  a  unit  root.  Hence,  the  Dickey-Fuller  test  is  applied  to  LS  residuals,  which  are 
computed  from  a  regression  with  intercept: 


ut  =  yt  -  &  -  fat- 

Due  to  the  included  intercept,  {ut}  are  zero  mean  by  construction.  Hence,  the  DF 
regression  in  the  second  step  may  be  run  w.l.o.g.  without  intercept: 

ut  —  a  ut- 1  +  et ,  t  —  1 . 

The  LS  estimator  a  converges  to  1  under  the  null  hypothesis  of  a  residual  unit  root 

/V 

with  the  rate  known  from  Sect.  15.3.  However,  /3  does  not  converge  to  0,  but  a 
limit  characterized  in  Proposition  15.5.  Consequently,  the  asymptotic  distribution  of 
n(a  —  1)  does  not  only  depend  on  one  WP  but  rather  on  two.  Let  ta( 2),  involving  2 
1(1)  processes,  denote  the  t- statistic  related  to  a  —  1.  Then  the  limit  depends  on  two 


366 


16  Cointegration  Analysis 


standard  Wiener  processes  W\  and  W2 : 

ia(2)-iv?(Wi,W2)  •  (16.11) 

Interestingly,  this  limit  is  free  of  nuisance  parameters  as  long  as  {Ayt}  and  {Axt} 
are  white  noise;  in  particular,  it  does  not  depend  on  the  eventual  correlation 
between  {Ayt}  and  {Axt}\  see  also  Problem  16.7.  If  {Ayt}  and  {Axt}  are  not  white 
noise  processes,  then  a  may  be  computed  from  a  lag-augmented  regression,  or  a 
modification  to  the  test  statistic  in  line  with  Phillips  (1987)  has  to  be  applied.  Critical 
values  or  p-values  are  most  often  taken  from  MacKinnon  (1991, 1996).  Here,  we  do 
not  present  a  definition  of  the  functional  shape  of  the  limit  VT  (W\ ,  W2) ;  rather,  to 
give  at  least  an  idea  thereof,  we  consider  now  explicitly  the  less  complicated  case 

/V 

without  constant.  So,  ut  and  /3  are  now  from  a  regression  without  intercept, 


/V 

ut  =  yt-  fix, . 


We  use  a  new  notation  to  denote  the  subsequent  DF  regression 

ut  —  a  ut- 1  +  st ,  t  =  1 , . . . ,  n  . 


In  Problem  16.7  the  following  result  is  given. 


Proposition  16.4  (Phillips  &  Ouliaris)  Let  {zt}  with  z!t  —  iyt,Xt)  be  a  random 
walk  such  that  {Ayt}  and  {Ax^}  are  white  noise  processes  not  correlated  for  t  ^  s, 
although  we  do  allow  for  contemporaneous  correlation.  In  case  of  no  cointegration 
it  holds  with  the  notation  introduced  above  that 


n(a  —  1) 


d  /o  U(t)  dU(t) 
Jo  UHt)dt 


n 


00  , 


with 


U(t)  :=  Wi(t)  - 


jo1  Wi  W2(s)Js 

fo  W%(s)ds 


W2(t ) , 


where  W\  and  W2  are  two  independent  Wiener  processes. 


The  functional  shape  of  this  limit  corresponds  exactly  to  the  one  in  (1.1 1);  only  that 
the  WP  W  is  replaced  by  U,  which  is,  however,  no  longer  a  WP.  Not  surprisingly, 
the  new  process  U  is  defined  as  residual  from  a  projection  of  W\  (corresponding  to 
y)  on  W2  (corresponding  to  x).  Once  more  this  shows  the  power  and  elegance  of  the 
functional  limit  theory  approach  introduced  in  Chap.  14. 
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Residual-Based  KPSS  Test 

It  comes  in  natural  to  apply  also  the  KPSS  test  to  regression  residuals.  As  in 
Sect.  15.3  the  hypotheses  are  now  exchanged:  We  test  for  the  null  hypothesis 

of  cointegration  against  the  alternative  of  no  cointegration.  Working  with  LS 

^  .  A 

cointegration  residuals,  vt  —  yt  —  bxt ,  we  have  under  the  null  hypothesis 

/V 

v,  =  v,  -  (b  -  b)  x, , 

where 

~  d 

n(b  —  b)  — > 

with  the  limiting  distribution  b0 0  characterized  in  Proposition  16.2.  Interestingly,  we 
observe  by  a  FCLT  (Proposition  14.1)  that 

n°-5(b-b)xim j  =4>  b00B2(r)  , 

with  B2  being  the  Brownian  motion  behind  {xt}  from  z't  —  (^2j=\  vj>  xt^j  •  Therefore, 

it  holds  that  ( b  —  b)  xt  converges  to  zero  with  growing  sample  size,  and  the  empirical 
residuals  are  proxies  of  the  unobserved  cointegration  deviation:  vt  ^  vt.  Hence,  it 
is  tempting  to  believe  that  a  KPSS  test  applied  to  the  sequence  {CJ  behaves  as  if 
applied  to  {?y}.  This,  however,  is  not  correct,  as  we  will  demonstrate  next,  since 
the  limit  characterized  in  Proposition  15.4  is  not  recovered  when  working  with 
cointegration  residuals. 

The  KPSS  test  builds  on  the  partial  sum  process  St  =  Y^)=\  ^j-  Mimicking  the 
proof  of  Proposition  14.2(a)  in  Problem  14.3,  we  obtain  the  following  FCLT  for  the 
partial  sum  process: 


[rn\ 

\_m\ 

n~°’5S  |r„j  =  n~°-5^Vj 

—  nib  —  b)  n~1'5 

7=1 

j=  i 

=r>  B  \  ( r)  boQ 

/  B2(s)  ds . 

Jo 

Notwithstanding  that  vt  ^  vt  we  must  thus  not  jump  at  the  conclusion  that  the 
residual  effect  is  negligible:  The  more  careful  analysis  showed  that  the  limit  of  the 
partial  sum  process  depends  on  the  distribution  b^  arising  from  the  cointegration 
regression. 

What  is  more,  we  know  that  the  LS  limit  b0 0  from  Proposition  16.2  is  plagued  by 
the  nuisance  parameters  Axv  and  co\2,  except  for  the  special  case  of  Corollary  16.1. 
Therefore,  Shin  (1994)  suggested  to  apply  the  KPSS  test  not  with  LS  residuals  but 
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with  residuals  from  an  efficient  regression  (now  with  intercept), 

S+  =  ^  fit ,  v+  —  yt  —  a+  —  b+x, , 

7=1 

where  is  efficient  in  the  sense  of  (16.9).  Efficient  cointegration  regressions 
rely  on  removing  Axv  and  con  consistently.  Hence,  Shin  (1994)  showed  that  the 
limiting  distribution  of  the  KPSS  test  applied  to  efficient  residuals  is  free  of  nuisance 
parameters,  and  he  provided  critical  values.  Let  i?+(2)  denote  the  residual-based 
KPSS  statistic  building  on  Vj~  =  yt  —  a+  —  b+xt,  thus  involving  two  1(1)  variables. 
Under  the  null  hypothesis  of  cointegration  it  holds  asymptotically  that  (Shin,  1994, 
Thm.  2) 


Tj+(2)XCM{WUW2), 


where 


CM  (Wi ,  W2)  = 


-f 


W\  (s)  —  sW\  ( 1 ) 


fo  w2(r)dr  /J  W^dWjjr) 
fo  W_l(r)dr 


2 

ds , 

(16.12) 


and  W  is  again  short  for  a  demeaned  Wiener  process.  For  related  work  see  Harris  and 
Inder  (1994)  or  Leybourne  and  McCabe  (1994),  although  the  latter  paper  considered 
only  the  case  of  LS  residuals. 


Error-Correction  Test 

The  third  test  we  look  into  is  not  residual-based.  The  analysis  rather  relies  on  the 
error-correction  equation  (16.2): 

y  vt~  i  +  differences  +  et . 

The  fact  that  we  restrict  the  analysis  to  the  error-correction  equation  of  {yr}  and  that 
we  ignore  Eq.  (16.3)  has  to  be  justified  by  the  assumption 

y*  =  0.  (16.13) 

This  assumptions  implies  that,  when  cointegration  is  present,  only  Ayt  reacts 
to  the  deviation  from  the  equilibrium  of  the  previous  period.  Because  of  the 
assumption  (16.13),  absence  of  cointegration  means  y  —  0,  which  is  the  null 
hypothesis.  In  order  that  there  is  an  adjustment  to  the  equilibrium  in  the  case  of 
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cointegration,  y  <  0  under  the  alternative: 

Hq  :  y  —  0  vs.  H\  :  y  <  0. 

To  provide  further  intuition  for  the  test  statistic,  we  rewrite  the  error-correction 
equation  (16.2)  by  inserting  the  definition  of  vt-\ : 

p  i 

Ay,  =  y(yt-i-  bxt- 1)  +  a,  AyH  +  a,-  A*,_/  +  e, 

;'=i  7=1 

P  i 

—  yyt-i  +  0xt-i  +  +  E  otj  Axt-j  +  St .  (16.14) 

j=  1  7=1 


Here,  we  defined 


0  =  -yb, 

where  the  null  hypothesis  of  course  implies  0  =  0.  Hence,  one  may  test  the 
null  hypothesis  by  means  of  an  F-type  test  statistic  for  y  —  0  —  0,  which  has 
been  investigated  by  Boswijk  (1994).  Alternatively,  one  may  employ  a  t-type  test 
specifically  for  y  =  0  only  as  proposed  by  Banerjee,  Dolado,  and  Mestre  (1998). 
The  following  proposition  characterizes  the  asymptotic  behavior  of  the  LS  estimator 
y,  cf.  Banerjee,  Dolado,  and  Mestre  (1998,  Proposition  1).  The  Wiener  processes 
W\  and  W2  are  adopted  from  Proposition  14.4  with  z't  —  (y?,^)-  In  order  to  get 
a  limiting  distribution  free  of  nuisance  parameters,  we  assume  that  Axt  and  ss  are 
uncorrelated  at  arbitrary  points  in  time: 

E  (Axtss)  =  0;  (16.15) 

see  also  the  proof  in  Problem  16.6. 

Proposition  16.5  (BDM)  Let  the  1(1)  processes  {xt}  and  {yt}  be  not  cointegrated, 
and  let  the  exogeneity  assumption  (16.15)  be  fulfilled.  Then,  it  holds  for  the  LS 
estimator  from  the  regression  (16.14)  that 

11  1  1 

/  Wj  (s)  ds  f  Wi  (s)  dWi  (s)  -fWi  (s)  W2  (s)  ds  f  W2  (s)  dW]  (s) 

~  d  0  0  0  0 

ny  ->  - -= - 

1  1  /!  \2 

/  W?(s)ds  f  Wf(s)ds  —  I  /  W\  (s)  W 2  (s)ds  I 

0  0  \o  / 


as  n 


00. 
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Obviously,  this  limiting  distribution  can  be  reshaped  into  the  following  form  (which 
lends  itself  for  a  multivariate  generalization  with  vectors  {xj): 


-l 


f  W,  (s)dW\  (s)  -  /  IV,  (s)  W2 (s)ds  f  W\ (s)ds  f  W2 (s)dW,  (.v) 


d  0 


0 


0 


0 


n  y 


1  1  / 1  \  1 1 

f  W\(s)ds  —  f  W\  (s)  W2 (s)ds  I  f  W|(s)<A  I  f  W\  (s)  W 2 (s)ds 

0  0  \0  70 


In  fact,  Banerjee  et  al.  (1998)  suggest  the  computation  of  the  t- statistic  relating  to 
y:  ty( 2).  As  limiting  distribution  under  Hq  one  obtains: 


d 


ty( 2)  W2) 


(16.16) 


-1 


/  Wi (s)dWi (s)  -  /  W|  (s)W2(s)ds  (  /  Wj(s)ds  j  /  W2(s)dWi(s) 

0  0  \0  /  0 


N 


-1 


1 


/  W\(s)ds  —  f  W\  (v)  W 2 (s)ds  I  f  W\ (s)ds  I  f  W\  (s)W2(s)ds 

0  0  Vo  70 


Again,  in  most  practical  situations  an  intercept  will  be  included  in  (16.14).  If  the 
corresponding  test  statistic  is  called  tY  (2),  then  it  holds  under  the  null  hypothesis  that 


ty( 2)  4  BVM(WU  W2) .  (16.17) 

This  limit  has  the  same  functional  shape  as  BVM{W\ ,  W2),  only  that  Wi  have  to  be 
replaced  by  the  demeaned  analogs  Wt,  i  —  1,2. 

Simulated  critical  values  for  conducting  the  tests  can  be  found  in  Banerjee  et  al. 
(1998).  One  rejects  for  small  values.  From  Ericsson  and  MacKinnon  (2002) p-values 
are  available,  too. 


Linear  Time  Trends 

It  has  been  mentioned  in  this  and  the  previous  chapter  that  in  practice  one  would 
run  regressions  with  an  intercept  to  account  for  non-zero  means  of  the  series.  But 
how  should  one  proceed  if  the  mean  function  follows  a  linear  time  trend,  i.e.  if  the 
series  are  1(1)  with  drift?  This  is  a  quite  realistic  assumption  for  many  economic  and 
financial  time  series  where  positive  growth  rates  are  plausible.  One  might  consider 
the  analysis  of  detrended  data,  see  Sect.  15.2.  Note  that  the  regression  of  detrended 
series  is  equivalent  to  including  a  linear  time  trend  in  the  regression  (see  Frisch  & 
Waugh,  1933): 


yt  —  <5  +  S  t  +  ft  Xt  +  Ut . 
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This  is  why  we  call  such  regressions  also  detrended  regressions.  Similarly,  one 
may  augment  the  error-correction  regression  (16.14)  by  a  linear  time  trend  (and  a 
constant).  Many  economists,  however,  do  not  run  detrended  regression  or  detrend 
the  series  even  if  the  data  display  a  linear  time  trend  by  eyeball  inspection. 
Economically,  it  is  often  more  meaningful  to  “explain”  one  trend  by  another  instead 
of  regressing  deviations  from  linear  trends  on  each  other.  Also  statistically  the 
regression  of  detrended  data  may  not  seem  advisable  since  power  losses  are  to  be 
expected  (see  Hamilton,  1994,  Sect.  19.2). 

Running  regressions  with  intercept  only,  i.e.  without  detrending,  in  the  presence 
of  linear  time  trends  in  the  regressors  has  some  subtle  implications,  however.  Gener¬ 
ally,  the  presence  of  a  linear  time  trend  in  the  data  not  accounted  for  in  the  regression 
will  affect  the  limiting  distributions.  Just  remember  that  a  simple  (or  bivariate) 
cointegration  regression  in  the  presence  of  a  linear  time  trend  (Proposition  16.3) 
resembles  more  the  detrending  of  a  trend  stationary  process  (Proposition  15.1)  than 
a  cointegration  regression  without  linear  trend  (Proposition  16.2).  More  precisely,  a 
linear  time  trend  in  {xt}  will  dominate  the  stochastic  unit  root  in  the  following  sense: 
If  {xt}  is  1(1)  with  drift,  E (Axt)  —  /i  +  eu  /i  ^  0,  or 

t 

xt  =  xo  +  fit  +  ej  ,  t  =  1 , . . .  n  , 

7=1 

then  this  process  grows  with  rate  n  (and  not  n °-5,  see  Proposition  14.1): 

^  =>>  0  T  /x  r  T  0  ,  /t  ^0. 

n 

This  provides  an  intuition  for  the  following  finding  by  Hansen  (1992,  Theorem  7): 
If  is  1(1)  with  drift  and  {yj  and  {vj  are  not  cointegrated,  and  if  a  regression- 
based  DF  test  for  no  cointegration  is  computed  from  a  regression  with  intercept 
but  without  detrending,  then  the  limiting  distribution  of  the  t- type  DF  statistic  is 
not  given  by  VTiWx.Wf)  from  (16.11),  but  rather  by  the  detrended  univariate 
distribution  VT  given  in  (15.10).  A  corresponding  result  was  established  for  the 
error-correction  test  by  Hassler  (2000a):  If  {vr}  and  {y?}  are  1(1)  but  not  cointegrated, 
and  if  {vr}  is  integrated  with  drift  and  the  error-correction  test  for  no  cointegration  is 
computed  from  a  regression  with  intercept  but  without  detrending,  then  the  limiting 
distribution  of  the  7-statistic  is  not  given  by  BVM(W\,  Wf)  from  (16.17),  but  by 
the  detrended  Dickey-Fuller  distribution  VT.  And  similarly:  If  {xt}  is  1(1)  with  drift 
and  cointegrated  with  {yt},  and  if  a  regression-based  KPSS  test  for  cointegration 
is  computed  from  a  regression  with  intercept  only,  then  the  limiting  distribution 
of  the  KPSS  statistic  is  not  given  by  CM(W\,  Wf)  from  (16.12),  but  rather  by  the 
detrended  univariate  distribution  CM  given  in  (15.1 1);  see  Hassler  (2000b).  Hence, 
we  have  the  following  proposition. 

Proposition  16.6  (Hansen  &  Hassler)  Consider  the  test  statistics  ta(2),  tY( 2) 
or  i?+(2)  computed  without  detrending  to  test  for  the  null  hypothesis  of  (no) 


372 


16  Cointegration  Analysis 


cointegration  of  the  7(1)  processes  {xt}  and  {yt}.  With  VT  from  (15.10)  and  CM 
from  (15.11)  it  holds  under  the  respective  null  hypotheses  that 


(a)  ta( 2) 


(b)  M2) 


(c)  rj+( 2)  — ■> 


d 


d 


d 


(  vf(Wu  w2) ,  iy^(^)  =  o 

(  T>T ,  i/ E(4*,)  ±  0  ’ 

i  WDM(Wl,W2)  ,  if  E(Axt)  =  0 

(  ^ ,  i/ ^  0  ’ 

j  CM(Wu  W2) ,  ifE(Axt)  =  0 
|  CAt ,  ifE(Axt)  ^  0  ’ 


as  n  — >  oo. 

Proposition  16.6  is  not  restricted  to  bivariate  regressions,  but  carries  over  to  the 
general  multiple  regression  case  as  follows: 

Consider  single -equation  regressions  estimated  by  LS  (or  efficient  variants 
thereof);  regressions  with  intercept  only  on  k  1(1)  regressors ,  of  which  at  least  one 
has  a  drift,  result  under  the  null  hypothesis  ( of  cointegration  or  no  cointegration, 
respectively)  in  a  limit  as  if  one  runs  a  detrended  regression  onk—l  1(1)  regressors. 

For  k  —  1  this  reproduces  Proposition  16.6.  For  a  proof  for  the  residual-based  DF 
test  with  k  >  1  see  again  Hansen  (1992),  and  also  the  lucid  discussion  by  Hamilton 
(1994,  p.  596,  597);  for  a  proof  for  the  error-correction  test  see  Hassler  (2000a),  and 
for  the  residual-based  KPSS  test  see  Hassler  (2001)  for  k  >  1. 

In  view  of  Proposition  16.6  one  may  identify  two  strategies  when  testing 
cointegration  from  regressions  with  intercept  only;  we  restrict  the  discussion  to 
the  case  of  a  scalar  regressor  xt.  First,  one  might  ignore  the  possibility  of  linear 
trends  and  always  work  with  critical  values  from  VTfW\,  W2),  BVM(W\,  W2)  or 
CM.  (W\ ,  W2)  provided  for  the  case  of  regressions  with  intercept  only  under  no  drift; 
we  call  this  strategy  Sj  (7  for  “ignoring”),  and  of  course  it  is  not  correct  if  {xj 
displays  a  linear  trend  in  mean.  Second,  one  may  always  account  for  the  possibility 
of  linear  time  trends  and  work  with  critical  values  from  VIF  or  CM;  let  us  call 
this  strategy  Sa  ( A  for  “account”),  and  note  that  it  is  only  appropriate  if  {xj  is 
indeed  dominated  by  a  linear  time  trend.  In  a  numerical  example  we  discuss  the 
consequences  of  Si  and  Sa  . 

Example  16.5  (Testing  under  the  Suspicion  of  Time  Trends)  We  consider  tests  at  a 
nominal  significance  level  of  5  %.  Let  c\  and  c2  denote  critical  values  from  VT 
or  CM  and  VT{W\ ,  W2 ),  BVM{W\ ,  W2)  or  CM(W\,  W2),  respectively.  For  the 
residual-based  DF  test  by  Phillips  and  Ouliaris  (1990)  we  take  asymptotic  critical 
values  from  MacKinnon  (1991):  ci  =  —3.41  and  c2  —  —3.34.  Coincidently,  these 
critical  values  are  not  very  distant.  Strategy  S/  results  in  a  slightly  too  liberal  test 
(rejecting  more  often  than  with  5  %  probability)  in  the  presence  of  a  drift,  while  Sa 
is  mildly  conservative  (rejecting  less  often  than  in  5  %  of  all  cases)  in  the  absence  of 
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a  linear  trend  in  the  regressor.  So,  for  the  residual-based  DF  test,  Proposition  16.6  is 
not  so  relevant,  since  the  distributions  happen  to  differ  not  that  much,  and  c\  %  C2- 
For  the  error-correction  test  by  Banerjee  et  al.  (1998),  however,  matters  are  not 
quite  so  harmless,  since  stronger  size  distortions  are  caused  by  a  larger  difference 
of  the  asymptotic  critical  values:  c\  —  —3.41  and  C2  —  —3.19.  With  the  residual- 
based  KPSS  test,  things  change  qualitatively  and  quantitatively.  Critical  values  from 
Kwiatkowski,  Phillips,  Schmidt,  and  Shin  (1992)  and  Shin  (1994)  are  c\  —  0.146 
and  C2  —  0.314,  and  thus  differ  dramatically.  Since  one  rejects  for  too  large  values, 
strategy  Si  implies  in  the  presence  of  a  linear  time  trend  a  very  conservative  test;  it 
will  hardly  reject  the  null  hypothesis,  which  comes  at  a  price  of  power  of  course. 
The  other  way  round,  without  linear  time  trends  strategy  Sa  will  reject  the  true  null 
hypothesis  much  too  often  resulting  in  an  intolerably  liberal  test.  ■ 

Clearly,  none  of  the  strategies  Si  or  Sa  is  generally  acceptable  when  testing  under 
the  possibility  of  linear  time  trends.  Fortunately,  one  often  has  strong  a  priori  beliefs 
regarding  the  absence  or  presence  of  a  linear  time  trend  in  the  regressor.  If  one  is 
convinced  that  a  linear  time  trend  is  present  in  {xt},  then  one  would  apply  e.g.  ta( 2) 
or  ty( 2)  with  critical  values  from  VT;  if  one  believes  that  there  is  no  linear  time 
trend  behind  {xt},  then  critical  values  from  VT{W\ ,  W2)  or  BVM{W\ ,  W2)  must  be 
recommended. 

If  one  is  not  sure  about  the  absence  or  presence  of  a  linear  time  trend  in  the  data, 
then  there  are  (at  least)  two  more  strategies  beyond  Si  or  Sa  one  may  employ.  As 
a  third  strategy,  one  may  always  test  from  detrended  data.  This  clearly  circumvents 
size  distortions,  but  comes  at  a  price  of  power  losses  as  has  been  acknowledged 
for  instance  by  Hansen  (1992)  or  Hamilton  (1994,  p.  598).  Fourth,  one  may  rely 
on  a  pretest  whether  the  regressor  follows  a  linear  trend  or  not.  If  a  linear  time 
trend  in  {xj  is  significant,  then  one  would  apply  e.g.  rj+( 2)  with  critical  values 
from  CM;  if  not,  then  critical  values  from  CM(W\,  W2)  should  be  applied.  Such 
a  pretesting  strategy,  however,  will  be  troubled  in  small  samples  by  the  problem  of 
controlling  the  significance  level  when  carrying  out  a  sequence  of  conditional  tests 
(multiple  testing).  A  recommendation  whether  the  strategy  of  generally  detrending 
or  the  strategy  of  pretesting  is  to  be  preferred,  when  the  presence  or  absence  of  a 
linear  time  trend  is  debatable,  will  require  future  research. 


1 6.5  Problems  and  Solutions 

Problems 

16.1  Show  the  equivalence  of  (16.1)  and  the  ARDL(2,2)  parameterization  from 
Example  16.1. 

16.2  Prove  statements  (b),  (c)  and  (d)  from  Proposition  16.2. 

16.3  Prove  statement  (e)  from  Proposition  16.2. 
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16.4  Prove  Corollary  16.1. 

16.5  Prove  Proposition  16.3. 

16.6  Prove  Proposition  16.5  for  the  special  case  that  no  lagged  differences  are 
required  to  obtain  white  noise  errors  {e*}: 

Ay?  —  yyt-\  +  0xt-\  +  st . 

16.7  Prove  Proposition  16.4. 

Solutions 

16.1  Using  y  and  b ,  one  obtains  with  A  —  1  —  L  from  (16.1): 

yt  =  yt- 1  -  (1  -  a\  -  a2)yt- i  +  (c0  +  c\  +  c2)xt- 1 

-a2(yt-i  -  yt- 2)  +  Cofe  -xt-i)~  c2(xt- 1  -  vr_2)  + 

—  ^U?— 1  +  a2yt-2  +  c$xt  +  +  c2xt~2  +  st . 

Hence,  the  claim  is  already  proved. 

16.2  We  proceed  in  the  same  way  as  in  the  proof  of  Proposition  15.5,  only  that  we 
work  under  (16.6)  when  appealing  to  Proposition  14.4.  We  start  with 

n 

—1  ~2 

n  22  vt 

n{\  —  R1  )  =  - — — 

n~ 2  22  yj 

t=  1 

The  numerator  on  the  right-hand  side  is  just 

•S2  =  H-1  ^  fV'  “  ^Xt) 

t= 1 

«  2 

=  «_1  ^  (/?  A)  -  bx,  +  t>,) 

t= 1 

n 

—  n~l  ^(Z?  —  S)2  v2  +  2  (b  —  b)xt  vt  +  vf 
t=  1 

«  n  n 

22t  22xtv, 

=  n  (b  -  b )2  +  2  (b  -  b)  — - + 

nA 


n 


n 


-2 


?=i 


n 


n 
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The  first  of  the  three  remaining  terms  tends  to  zero  as  is  °f  order  n2  and 
(b  —  b )  is  of  order  n~  ,  correspondingly,  the  second  expression  tends  to  zero  as 
vt  grows  with  n\  finally,  the  third  term  converges  to  Var(iv)  as  a  law  of  large 
numbers  holds  for  {v2}.  This  proves  Proposition  16.2(c).  By  the  same  arguments, 
one  establishes: 

n~ 2  Y "X2  —  n~2  ( b 2 x2  +  2  bxt  vt  +  v2)  -4-  b2  f  B\(s)  ds  +  0  +  0  . 

t=i  t=  i  Jo 


Hence,  Proposition  16.2(b)  is  proved  as  well. 

Finally,  the  behavior  of  the  Fstatistic  with  si  —  s2 /  is  clear  again  by 
Proposition  14.4: 


n 


1  oo 

f  ,  \  E  X,  V,  f  B2(s)dBx  (s)  +  E  E  vt+h) 

b  —  b  "  t=\  d0  h= o 

tb  —  —  -  ^ 


Sb 


n 


t=  1 


Y\ (0)  / B\(s)ds 


16.3  In  order  to  analyze  the  behavior  of  the  Durbin- Watson  statistic,  we  only  need 
to  study  the  numerator, 


n 


n 


n 


1  ( v ,  -  Vi)2  =  n 1 y]  ((*  -  ^ 


t= 2 


t=  2 


±Ax,Av,  ±(Avty 
=  (b  -  bf— - +  2 (b  -  b )  — - +  — - 


T,(Ax,y 


n 


n 


n 


As  ( b  —  b)  tends  to  zero,  there  remains  asymptotically 


n 


n  1  y>{i,)2  — >  Var(Zi  vt)  =  2  Var(i;,)  -  2Cov(u,,  vt-\) 


t= 2 


Hence,  it  holds  as  claimed: 


dw  = 


n  EC^r) 

t=2 


■»  2(1 -Ml))* 


as  s2  approaches  Var(rv)  with  n  growing. 

16.4  The  result  will  follow  from  Proposition  16.2.  By  (16.7),  we  have  &>i2  =  0. 
Due  to  the  resulting  diagonality  of  X2,  X20  5  is  diagonal  as  well,  cf.  from 
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Example  14.6.  Hence  it  holds: 


l  B2(s)  )  \co2W2(s))' 

and  a) 2  cancels  from  the  limiting  distribution  of  Proposition  16.2(d)  (using  the 
second  assumption  Axv  —  0): 


/  W2(s)dWi(s)a)l 
0 

Pi (0)at2  /  W\{s)ds 


f  W2(s)dWi(s) 
to)  1  0 


yjn  (0) 


f  w\  (s)ds 


According  to  Proposition  10.4,  the  stochastic  quotient  on  the  right-hand  side  follows 
a  standard  normal  distribution.  Hence,  it  holds  that 


co  i 


which  proves  the  corollary. 
16.5  By  assumption,  it  holds: 


n 


n 


-3 


t=  1 


n 


n 


/x2yy +  2^ 

t=  i  r=i 


+  E 


We  know  from  Proposition  14.2(c)  and  (e)  that  the  second  and  the  third  expression 
in  square  brackets  have  to  be  divided  by  n2  5  and  n 2,  respectively,  such  that  they 
converge.  However,  in  front  of  the  square  bracket  there  is  n~ 3,  such  that  it  holds 
with  (15.2): 


n 


n 


-3 


t=  1 


El  d 

3  ' 


Hence  we  have  the  denominator  of  the  LS  estimator  under  control: 


b  =  b  + 


En 

,=  1  XtVt 

E"=i 


In  order  to  crack  the  numerator,  we  consider 


•^LsnJ  x0  ,  ft  ,  Ej=lJ 
= - 1 - 1 - - - 


n 


n 


n 


n 
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and  due  to  Proposition  14.1  it  holds 


X[sn\  _  ix[sn\ 
n  n 


/ 1  s , 


and  xt  is  dominated  by  a  linear  trend,  i.e.  xt  behaves  just  as  fit.  Thus,  as  for  the 
detrending  in  the  trend  stationary  case,  we  obtain  with  a  standard  Wiener  process  V 
that  (see  Sect.  15.2): 


nl5(b  —  b)  — 


n 

n~h5  E*<v< 

r=l 

E  A 

t=  1 


ficoi  f  sdV(s ) 
d  o 


3  co\ 

fi 


fi2/3 

l 


/ 


o 


sdV  (s) 


(JL 


where  co\  V(s)  is  the  Brownian  motion  corresponding  to  Ylj=  i  V/'»  an<^  the  long- 
run  variance  of  {nj.  The  normality  of  the  integral  follows  from  Example  9.2.  Hence, 
the  claim  is  proved. 

16.6  Because  of  the  simplifying  assumption  we  consider  the  regression  of  (16.14) 
without  differences.  As  LS  estimator  for  the  vector 


one  hence  obtains  for  a  sample  t  —  1 ,  ...  ,  n: 


(16.18) 


where  D  has  the  form 


D  = 


n 


T,y,-i 

t=  1 


n 


J2xt-iyt-i 
\t=  i 


n 

E  yt-ixt-i 

t= i 

n 

E^-i 
/=  1 


\ 

/ 
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Plugging  in  Ayt  —  et  under  //q,  we  obtain 


*= °-'±  (■:: 

t=i  v 


. 


In  the  case  of  no  cointegration,  for  using  Proposition  14.4  we  choose 


zt  = 


yt 

Xt 


Then  it  holds  for  the  matrix  D : 


n~2D 


d 


f  B\  (s)  ds  f  B\  (.s1)  Z?2  (‘S')  ds 


\ 


0 


0 


f  B\  (s)  B2  (‘S')  ds  f  B 2  (s)  ds 

\o  0 


7 


For  the  inverse,  this  implies 


o  1  d  1 

n2D~l  ->  — 
det 


/  1  1 

f  B 2  (s)  ds  —  /  Z? 1  (s)  #2  (‘S')  ds 

0  0 

1  1 

—  /  (s)  Z?2  (‘S')  ds  /  #1  (‘S')  ds 

V  0  0 


\ 


(16.19) 


where  “det”  stands  for  the  determinant  of  the  limiting  matrix  from  (16.19).  Here, 
the  known  inversion  formula  for  (2  x  2) -matrices  was  applied: 


a  b 
c  d 


-1 


1  (  d  -b 


det  \  —c  a 


det  —  a  d  —  b  c  . 


(16.20) 


Note  that  Proposition  14.4(b)  was  applied  with  Zt-\  instead  of  zt.  This  is  unproblem¬ 
atic  and  can  be  justified  by  similar  arguments  like  those  leading  to  Corollary  14.1  in 
the  univariate  case. 

Next,  we  analyze  with  Proposition  14.4(c) 


n 


1 E  (*_*  )Ayt  =  n  1  it, (Zt  ~  w<) 

t=  1  w  1  /  /=  1 


/oo 

B  (s)  dBi  (s)  +  ^  E  (wtw\,t+h)  ~  E  ( w,wu ) 
0  h=0 
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B  (s)  dB\  (.9)  +  y^E  (wtwu+h) 
o  h=1 

1 

=  J  B(s)dBl(s)  +0, 

0 

where  we  used  that  w\it  —  Ayt  —  st  is  free  from  serial  correlation  and  is 
uncorrelated  with  Axs  at  each  point  in  time,  see  (16.15): 

E  (w,wu+h)  —  0 ,  h>  0. 

Thus,  under  the  null  hypothesis  of  no  cointegration,  we  obtain  that  y  tends  to  zero. 
For  this  purpose  we  consider  the  first  row  of  the  limit  of  n2D~l  multiplied  by 
J  B  (s)  dB\  (s): 


n  y 


li  l  l 

f  B 2  (s)  ds  J  B\  (s)  dB\  (s)  —  f  B\  (s)  B 2  (s)  ds  J  B2  (s)  dB\  (s) 
d  0 _ 0 _ 0 _ 0 _ 

det 


This  is  almost  the  claim  as  “det”  is  defined  as  the  determinant  of  the  limit  of  n~2D. 
Finally,  note  that  Bt  =  coiWi  holds  as  Axt  —  W2,t  and  =  £s  =  w\jS  are 
uncorrelated.  Thus,  the  long-run  variances  cancel  from  the  limiting  distribution  and 
one  obtains  the  required  result. 

16.7  Under  the  null  hypothesis  of  no  cointegration  we  define  z[  —  (yt,  xt)  with 
long-run  variance  matrix  of  full  rank.  The  corresponding  vector  Brownian  motion 
B'  =  (B 1 ,  B2)  can  be  written  in  terms  of  independent  WPs  Wr  —  (W\ ,  W2)  as 


Bm  =  TW(,)=\‘"W'(,)  +  "SWM 

0)2  W2  (t) 


Here,  T  is  the  triangular  decomposition  given  in  (14.7)  with  TT'  =  £2 .  The  limit  of 

/V 

ft  from  Proposition  15.5(a)  hence  becomes 

„  _  lo  Bi(s)W2(s)ds  _  tn  /„'  Wl(s)W2(s)ds  +  f0'  Wf(S)& 

^2Poo  —  rt1  ^  —  .1  ^  •  (16.21) 


fo  Wj(s)ds 


fo  W%(s)ds 


/V 

With  this  result  one  obtains  a  FCLT  for  the  residuals  ut  —  yt  —  pxt: 


n  =>  B\{r)  -  B2{r) 

=  (h-Poo)TW(r) 
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=  t\\ 


=  tnU(r). 


fo  W^Wijs)  ds 
fo  W%(s)ds 


Unfortunately,  however,  Proposition  14.2  does  not  apply  directly  since  U(r)  is  not 
a  WP.  Still,  with  the  techniques  we  used  to  prove  Proposition  14.2(e)  and  (f)  in 
Problems  14.3  and  14.5,  we  can  establish  (omitting  details) 


n 


-2 


n  ,  rl 

iz k'-'  <11  / 

t=  i 


U2(t)dt , 


ft 


1  y]  4  4312  _  i  (ft)2  _  2£ooft)i2  +  - 


r=l 


where  the  last  limit  arises  because  {Azt}  is  white  noise  such  that  cof  and  co n  coincide 
with  the  (co) variances.  Further,  note  by  B  —  TW  that 


tncol2  fo  Wi(s)W2(s)ds  co2n 
conPoo  = - - — r - h 


"2  fo  W%(s)ds 


(16.22) 


Remember  from  (14.7)  that 


4  =  4- 


co 


12 


COr 


Consequently,  for 


a 


_  j  E/=i  u,-iAu, 


V  1J2 


we  get  by  (16.21)  and  (16.22)  that 


u2{  i) 


-  1  + 


ft  (5  —  1) 


flwdsmWdsX 

fo  (s)ds  J 


fg  U2(t)dt 


The  numerator  of  this  limit  may  be  condensed.  Use  the  product  rule  from 
Example  11.5  to  obtain 


Wl{\)W2{\)=  /  IV  i  (s)dW2(s)  + 


f 


W2{s)dW\{s) . 
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It  follows  that 


U2(  1)  W\{\)  f  Wi(s)W2(s)ds 


fo  W%(s)ds 


Wi  ( s)dW2  (v)  +  f  W2  (s)dW\  (s) 
o  Jo 


fo  Wds)W2(s)dsY  W\(  1) 
fo  W2(s)ds  )  2 

Once  more  by  Ito’s  lemma  w'f  =  f  Wj(s)dW,(s)  +  2  such  that 


U2(  1) 


f 


w^dWxis)  +  i 
o  2 

So  W\(s)W2(s)ds 
fo  W2(s)ds 


Wi  ( s)dW2  (s)  +  f  W2  ( s)dWi  (s) 
0  Jo 


|  (  fo  V0l(.v)V02(.v)^/.v\ 
fo  W2(s)ds  ) 


f'  1  /  /  /;'  Wi(s)W2(s)<is\ 

=  j  U(t)dU(t)+-  I  1+  1  Jo  W  W  ‘ 


1  1 
W2(s)dW2(s)  +  - 

2' 

fo  w  x 


fo  Wj(s)ds 


This  provides  the  expression  for  the  limiting  distribution  given  in  Proposition  16.4 
as  required. 
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Algebra,  14 
cr-,  14 
Borel-,  16 
ARCH  model,  130 
EGARCH,  140 
GARCH,  135 
GARCH-M,  139 
IGARCH,  137 
Autocorrelation,  31 
Autocovariance,  31 

Autoregressive  distributed  lag  model,  354 


Brownian  bridge,  163,  339 
Brownian  motion,  156 
with  drift,  162 
geometric,  165,  268 


Cadlag,  311 
Causally  invertible,  54 
Cholesky  decomposition,  319 
Coefficient  of  determination,  342 
Cointegration,  341,  355 
Comparison  of  coefficients,  53,  71 
Continuous  differentiability,  200 
Convergence,  155 

Cauchy  criterion,  188 
in  distribution,  155,  190,  314 
in  mean  square,  181,  186 
in  probability,  189 
weak,  308,  313 
Correlation  coefficient,  24 
Covariance,  23 
Cycle,  78 
annual,  83 
cosine,  78 
semi-annual,  83 
sine,  224 


Density  function,  17 
Detrended  regression,  371 
Detrending,  331,  371,  373 
Dickey-Fuller  test,  336 
Difference  equation,  56,  60 
deterministic,  60 
stochastic,  56 
Differential  equation 

with  constant  coefficients,  265 
deterministic,  264 
homogeneous,  264,  266 
stochastic  {see  Stochastic  differential 
equation) 

Diffusion,  221,  243,  261 
Distribution,  22 
conditional,  27 
joint,  22 
marginal,  22 
multivariate,  30 
Distribution  function,  16 
Drift.  See  Integrated  process 
Durbin- Watson  statistic,  342 


Error-correction  model,  353,  355 

Event,  13 

Expectation 

conditional,  27 
Expected  value,  18,  267 


Filter,  51,80,  85 
causal,  51 
difference,  52 
Fractional 

differences,  106 
integration,  107 
noise,  108 
Frequency,  78 
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Functional,  312 

Functional  central  limit  theorem,  307 


Gamma  function,  107,  119,  121,  175 
Gaussian  distribution 
asymptotic,  361 
Gaussian  process,  30 


Impulse  response,  50,  87,  104 

Index  set,  29 

Inequality 

Cauchy-Schwarz,  25 
Chebyshev’s,  20 
Jensen’s,  26 
Markov’s,  20 
triangle,  25,  312 
Information  set,  32,  128 
Integrated  process,  305,  306,  317 
with  drift,  334,  364 
of  order  -1, 1(- 1),  306 
of  order  0, 1(0),  306 
of  order  1, 1(1),  306 
Invariance  principle,  308 
Ito  integral,  215 

autocovariance,  217 
expected  value,  216,  217 
general,  219 
variance,  216,  218 
Ito’s  lemma 

bivariate  with  one  factor,  245 
for  diffusions,  243 
for  Wiener  processes,  240 
multivariate,  250 

with  time  as  a  dependent  variable,  248 
Ito  sum,  214 


KPSS  test,  338 
Kurtosis,  19 


Lag  operator,  5 1 
Lag  polynomial,  53 
causally  invertible,  54 
invertible,  54 

Least  squares  estimator,  4,  331 
Leverage  effect,  141 
L’ Hospital’s  rule,  205 
Linear  time  trend,  331 
Long  memory,  104,  110 


Long-run  variance,  303 

consistent  estimation,  335 
matrix,  317 

Markov  property,  59 
Martingale,  32 

difference,  33,  129 
Mean  squared  error,  187 
Measurability,  15 
Metric,  311 

supremum,  312 
Moments,  18 
centered,  18 


Normal  distribution,  21 
bivariate,  24 
log-,  165 

Omstein-Uhlenbeck  process,  204,  248,  285 
properties,  205 

Partial  integration,  200 
Partition,  153,  180 
adequate,  180 
disjoint,  153 
equidistant,  153,  180 
Period,  78 

Persistence,  50,  59,  86,  104 
anti-,  111 
strong,  108 

Power  transfer  function,  80,  85 
Probability,  13,  14 
space,  14 
Process 

ARCH  (see  ARCH  model) 

ARMA,  64 
autoregressive,  56 
continuous-time,  30 
discrete-time,  30 

integrated  (see  Integrated  process) 

invertible,  65,  88,  109 

linear,  49 

Markov,  32 

moving  average,  45 

normal,  30 

pure  random,  3 1 

stationary,  30 

stochastic,  29 

strictly  stationary,  30 

weakly  stationary,  3 1 
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Random  variable,  15 
continuous,  17 
integrable,  26 
Random  walk,  151 

continuous-valued,  153 
discrete- valued,  152 
Residuals,  334,  335 
Riemann  integral,  181 
autocovariance,  185 
expected  value,  182 
Gaussian  distribution,  183 
variance,  184 

Riemann-Stieltjes  sum,  200 
Riemann  sum,  180 


Stratonovich  integral,  217 
Superconsistency,  358 


Theorem 

Donsker’s,  308 
Fubini’s,  22,  182 
Slutsky’s,  315 
Time  series,  29 
Trend  component,  8 1 
Trend  stationary,  332,  333 


Unit  root,  306 


Schur  criterion,  56 
Set  of  outcomes,  13 
Skewness,  19 

Spectral  density  function,  8 1 
Spectrum,  80,  8 1 
Stieltjes  integral,  200,  220 
autocovariance,  204 
definition,  199 
expected  value,  202 
Gaussian  distribution,  202 
variance,  202 

Stochastically  independent,  23 
Stochastic  differential  equation 
with  constant  coefficients,  268 
with  linear  coefficients,  263 
inhomogeneous  linear  with  additive  noise, 
268 

moments  of  the  solution,  267 
uniqueness  of  solution,  262 


Variance,  18 
Variation,  222 
absolute,  222 
quadratic,  225 
Volatility,  127 


White  noise,  31,  81 
Wiener  process,  156 
demeaned,  309 
hitting  time,  160 
integrated,  185 
maximum,  167 
reflected,  164 
scale  invariance,  159 
zero  crossing,  161 
W.l.o.g.,  60 

Wold  decomposition,  5 1 


